GlenKPeterson / PdfLayoutManager

Adds line-breaking, page-breaking, tables, and styles to PDFBox
45 stars 20 forks source link

TTF font handling #4

Closed jartysiewicz closed 8 years ago

jartysiewicz commented 9 years ago

Hi Glen

I just created small patch for handling TTF fonts in your library. Test.pdf didn't display correctly Russian an Chinese (and possibly some other languages) but original version using Type1 fonts also render it wrong. I don't know maybe it's something on my machine.

best Jan Artysiewicz

GlenKPeterson commented 9 years ago

Wow - COOL! I'm going to list all the hurdles to including this patch, because that will be our TO-DO list for incorporating your patch into this project. But that could come across as negative and I want you to know that your contribution is not merely positive, it's AWESOME! I always wanted to have font support for more languages, but the first couple things I tried didn't work. You made it look easy! You may need to educate me a little bit for this to move forward...

1. This is the biggest item, but it's a discussion, not a requirement...

My understanding is that all PDF reading software has the PDFonts included. So for PDFs that don't actually need another font, there is no need to embed any fonts in the PDF file, thus decreasing file size. At work, we use PdfLayoutManager to generate PDFs on the fly from real-time data on a web server, so we have to worry about download time, server load, etc. I'd like to keep the option to use only PDFonts and let the user choose when to embed one or more PDType1Fonts instead.

What's covered by PDFonts and what are their limits? That's kind of an ugly topic. See here if you're interested (this is probably way too much information, just read the first few lines of comments and maybe pop over to the Wikipedia article): https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L521

This deserves some thought and maybe discussion, but not necessarily any action. Is there ever a case where we'd start using PDFonts, then encounter a character outside that character set and change font for just that one character on the fly? Could we keep track of what characters we used, and only embed the characters we need from the other font? Would that be too difficult, or would it throw off the page layout, or even just look bad to mix fonts like that? You are welcome to email me at my first name (which you spelled correctly - thank you) at organicdesign dot org, and I'll reply with my phone number which you are free to call 9AM-7PM US Eastern Time any day to discuss.

The simplest solution is probably just to make a second version of each of the methods that take a PDFont currently, so that the user can choose whether to use the PDFont, or the PDType1Font version of that method.

Hmm... The fonts are only 350K, 200K, and 200K. That's a lot smaller than I expected. Can you really include Chinese in such a small font? I sort of suspect that this version of the font does not include those characters and I think such a font would be much bigger. Especially for different size characters... I need more information and to think about it for a little bit, but maybe it's OK to always include a font or two?

2. When I look here: http://www.fonts2u.com/ubuntu-mono.font It looks like the font is licensed GNU/GPL. Is that correct? I don't know if including the font is considered "linking" for the purposes of the GPL, but it makes me very nervous to distribute anything GPL with Apache-licensed software. I need to use this project at work and sometimes distribute it for profit without open-sourcing all the code that we link to it.

I've never run across the issue of combining a GPL font with Apache code before, so I don't know what affect it has. We need to find out before we can incorporate the font. Not from some random discussion on the web. From an official statement by the Apache Software Foundation, the Free Software Foundation, or from the authors of the font. We need proof that including a GPL font will not contaminate Apache-licensed software.

3. Please don't change the compiler target/source from Java 1.5 to 1.7. It was a bit of work to say we support 1.5 (though I've only actually compiled it with 1.6) I think restricting users to Java 1.7 or newer will significantly diminish the potential audience and supporting 1.5 isn't that hard. See Issue https://github.com/GlenKPeterson/PdfLayoutManager/issues/1 from October 2014

4. Please do not use .* for imports ever. Set your IDE to only use .* with 99 files or something. See here: http://stackoverflow.com/questions/147454/why-is-using-a-wild-card-with-a-java-import-statement-bad

THANK YOU for your contribution. I think all of these issues are speed bumps, not roadblocks. I'd love it if you made these changes. I could do it if necessary, though it may take me a few weeks.

jartysiewicz commented 9 years ago

Glen

I don't know what is law effect of mixing GPL fonts with Apache style licences software. We can just put Apache style licencing font in place of Ubuntu ones, we can also delete it (are used only for test). A little bit of my motivation, I'm looking for some solution for generating a lot (tens of thousands) PDF reports in short amount of time. We want full opensource, "programmer friendly" solution so this excludes Apache FOP (brain damage while using) and iText (slimy license). This leads me to checking your project and the first thing: when you live in Europe you must deal with a lot of letters outside ASCII charsets. Embedded fonts are necessity not an option, I also used to work for companies which has got own typefaces (no licence issuses!) and disk space/network throughput are on second places. And at least PDFont is base class for PDType1Font you do not need to create multiple versions. I just created commit which reverted some changes, I also must think for while on your multilanguage questions (I'm from Poland and I'm not so quick with thinking in English).

Sorry for changing JVM version, have some kind of "reflex" when I see this :), I'v reverted this (and * in static import) with new commit

GlenKPeterson commented 9 years ago

There is a list of languages that PdfLayoutManager supports fully and partially here: https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L841 That covers many of the most popular "Western" languages used around the world, plus "standard" transliteration for Russian. Many companies in countries that use languages not covered by that set use English for business, and this font set has proven good-enough so far where I work now.

I just looked at the ClearlyU font which has almost 10,000 characters (yay!), and it's 1.2 megabytes (boo!). That's bigger than most of the PDF files we currently produce, so including it in every PDF file would make each file take at least twice as long to download - not generally acceptable. Even 10K characters is a fraction of the 110K characters currently in Unicode.

Some fonts are split into several files containing the different scripts. Using a font like that, we could search each String that's written to the PDF file. The first time we encounter a character from a script, we know to embed the appropriate section of that font. Then we only end up embedding the huge font files when we actually need those characters. That's a fair amount of work, but that seems like the ideal design. Any ideas how to scale back from that, to what you actually need today?

Yes, an Apache-friendly license font would be ideal! Here is a list of officially Apache-friendly licenses: https://www.apache.org/legal/resolved#category-a

Your English is excellent. I had no idea that it wasn't your first language.

Regarding your latest diff: You need to provide the minimum possible diff for me to merge. Don't change the style, reformat, or change spacing. Even if those changes make it better, it's harder to read the diff. I've had other people say that to me before and I didn't like hearing it, but I still need you to follow the minimum-diff rule.

If you are really interested in correcting only inconsistent formatting, you can submit formatting-only changes as a separate diff, but I think that's just a distraction from a potentially much larger contribution at the moment.