GlenKPeterson / PdfLayoutManager

Adds line-breaking, page-breaking, tables, and styles to PDFBox
45 stars 20 forks source link

Should accept any PDFont over PDType1Font in TextStyle #5

Closed tveimo closed 9 years ago

tveimo commented 9 years ago

It would be better if any PDFont instance, eg PDTrueType, could be accepted in TextStyle, instead of only PDType1Font.

GlenKPeterson commented 9 years ago

I agree! Especially for supporting character-sets outside of WinAnsiEncoding AKA Windows Code Page 1252: http://en.wikipedia.org/wiki/Windows-1252

Unfortunately, this is going to be difficult. Most fonts that support a wide range of characters are big, like 10MB or more. Embedding such a font in every PDF file is totally unacceptable for most people who have to build PDF files on the fly. So the code is going to have to detect what characters are used, then embed an appropriate subset of characters in the resulting PDF. Different fonts are already divided into subsets, but not necessarily the same subsets.

Last time I looked, PDFBox had hard-coded a character encoding that made it difficult for me to work with alternative character encodings. What they did might be correct, but I looked at it, got confused, and gave up. Another volunteer started playing with this and gave up too.

That said, this is definitely a solvable problem. There is a broad spectrum of for-profit PDF-producing software. I think the main reason they can charge money for their products is precisely because this problem is so hard.

PDType1Fonts are guaranteed by the PDF spec to be supported in all PDF readers without any font embedding. The character set they support is good enough for many purposes. My solution was to map every WinAnsi character from it's Unicode equivalent, add transliteration for Russian, and convert any other character to a bullet, so that at least you'd see when some characters weren't being encoded properly (as a string of dots).

PdfLayoutManager supports a character set that includes the following languages:

Afrikaans (af), Albanian (sq), Basque (eu), Catalan (ca), Danish (da), Dutch (nl), English (en), Faroese (fo), Finnish (fi), French (fr), Galician (gl), German (de), Icelandic (is), Irish (ga), Italian (it), Norwegian (no), Portuguese (pt), Scottish (gd), Spanish (es), Swedish (sv)

Romanized substitutions are used for the Cyrillic characters of the modern Russian (ru) alphabet according to ISO 9:1995 with the following phonetic substitutions: 'Ch' for Ч and 'Shch' for Щ.

Details: https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L841

GlenKPeterson commented 9 years ago

I closed this issue labeling it "wontfix" meaning I won't fix it NOW because it's going to be challenging and I don't want anyone holding their breath. But I also labeled it "enhancement" meaning that I am very open to conversation or patches toward achieving this goal, and may even tackle it myself some day.