[CTS 16] Consistent code point handling across font types

PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents

https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full

Other

6 stars 7 forks source link

[CTS 16] Consistent code point handling across font types #81

Open PhilterPaper opened 6 years ago

PhilterPaper commented 6 years ago

Ref RT 120048/#47

Core fonts and Type1 fonts are currently limited to single byte encodings, and use the automap() method to map their glyphs over multiple planes (of up to 256 glyphs each). It would be good to extend them in some way to handle UTF-8 text, so that one would not need to constantly switch between subfonts (planes) to see and use all the glyphs in a font (see 020_corefonts, 021_psfonts, 021_synfonts). Is there any way to natively use UTF-8 with these font types? We want to avoid automatically running automap and switching planes under the covers, as this would be very bulky and slow. Also, automap does not guarantee that the same code point will map to the same glyph over different versions of a given font file!

On the other hand, TrueType and OpenType fonts are UTF-8 ready, but utilities such as 021_synfonts need to be extended to show glyphs beyond the first page (plane 0). 022_truefonts shows plane 0 per the encoding, but everything else is listed by CID (Character ID), arranged by Unicode point. Perhaps automap() could be written to handle this? We want 021_synfonts to display all glyphs for a TrueType font.

The idea is to get consistent text handling, regardless of what kind of font (core, Type1, TrueType, etc.) happens to be used. If you're content to stay in a single byte encoding, you can do that (although automap should continue to be supported for legacy purposes). If you want to use UTF-8 with core or Type1 fonts, to seamlessly access all glyphs by Unicode point, you should be able to do that.

PhilterPaper commented 6 years ago

In Type1 (and possibly core) fonts, there is Unicode point information, so in theory we can determine the glyph number (GID, G+nnn) for any desired Unicode character at document creation time. However, the current output mechanism is based on a map of single byte-to-glyph name, and something else would have to be found.

PhilterPaper commented 3 years ago

For the single-byte encodings that Core and Type1 routines support, UTF-8 support could be added with a glue layer to build one or more custom encoding tables per PDF (or per page). Thus, a given font file might see two or more subfonts, each with an encoding table of up to 220+ glyphs (Unicode points). and selecting which subfont to use on-the-fly. This of course could mean the possibility of frequent switching of font objects on a given page, but would be the price to pay for allowing UTF-8 encoding (mapped to one of the subfonts).

Matters might be improved by doing subfont(s) for each page, rather than globally. Fill up one subfont encoding table, then start the next one from scratch (empty). The second table would probably largely overlap the first (especially the ASCII content), but there would be very little switching back and forth between subfonts needed. Hopefully, no more than one or two subfont tables would be needed for a page, and maybe for an entire document, assuming you're not doing something like a font dump (ttfont, synfont, etc.).

I think that UTF-8-to-single-byte mapping tables would be "first come, first served", rather than trying to maintain Unicode (or Latin-1) ordering over any part of it. That way you could fill 256 glyphs per subfont, and fit the next chunk of points so frequent swapping isn't needed. Even fairly pathological cases such as font dumps, while needing perhaps dozens of subfont table, are feasible.

Add: there is still the issue of font files changing over time (add/remove glyphs, change GIDs), so this might require embedding the font (or a subset of it) in the PDF, to ensure stability. If we're going to go through the trouble of doing that, it might be as effective to synthesize a UTF-8 TTF font from the T1, core, CJK, etc. font and save that.

Core fonts are often TTF, but this is hidden from us when we use corefont(), and they could still change over time.

PhilterPaper commented 2 years ago

If doing subfonts for a page, it might be best, when the next subfont is needed, to preload it with the entire 7-bit ASCII set, as these are going to be frequently used anyway (at least, for Western [Latin alphabet] languages). This still gives 128 extra characters to build up another chunk of page, which should be enough for most purposes. The font (subfont) will only have to be changed once or twice a page, rather than constantly going back and forth between subfonts. And, ASCII text will still be readable in the PDF (i.e., not shifted around). If you are working in a non-Western language and don't really need ASCII, perhaps there could be a flag to not initialize with the ASCII character set?

Keep in mind that when building up a subfont (ASCII plus x80-xFF range) for single-byte encodings such as corefont, that the text itself will have to be translated to use the single byte codepoint (which will vary) instead of the original Unicode codepoint. We also assume that the Unicode name of each glyph is available for placement in the subfont table as needed.

PhilterPaper commented 2 years ago

If we can map a multibyte UTF-8 input character to a single-byte (0x80-0xFF range), mapped to (one of the) the current page's subfonts, would that be enough? Or are there other aspects of TTF/OTF fonts that are currently ignored in corefonts, psfonts, etc. that should be implemented? Kerning information? Alternate glyphs and ligatures? Of course, that information would have to be present in the file so it could be read! Would it be any good to determine if a corefont/cjkfont, etc. is actually a TTF/OTF and directly use those facilities, instead? If it's not a TTF/OTF font file at its base, HarfBuzz::Shaper won't be able to do anything with it. A user always has the option, if they know that it's really a TTF/OTF font, to use ttfont directly and skip all the nonsense about corefont and cjkfont, so it may not be worthwhile to go too far into this. However, UTF-8 support could still be quite useful for corefont and psfont.

PhilterPaper commented 1 year ago

Keep in mind that Adobe is phasing out support for T1 (PS) fonts. Currently, their (new releases) editors and other creation software supposedly will not handle T1 fonts. Eventually, it is possible that even their Readers won't recognize T1 fonts. There's nothing that says Third Party creation, editing, and reading software (such as PDF::Builder) can't continue to use T1 fonts, but be aware that PDFs created with T1 fonts will start having trouble being handled with Adobe products, presumably eventually including readers such as Acrobat Reader. In addition, new T1 fonts will probably be in short supply, as font houses may not bother creating new fonts in T1.

I recently added support for .t1 font types, but don't plan to go much beyond that for T1 font support. If I can support core fonts with UTF-8 input, I'll probably go ahead and copy that functionality to T1 fonts, as most of the work will have already been done and it should be just a SMOP to add it to psfont() support.

PhilterPaper commented 1 year ago

Apart from universal UTF-8 (multibyte character capability), are there any prospects for embedding non-TTF (core, Type1, etc.) fonts in a PDF? See #80 for more discussion. Even core fonts can vary slightly from installation to installation, and I'm not sure about T1 fonts (whether they're already embedded or not -- supposedly it is possible to embed T1). More research needs to be done to see what information is carried along -- perhaps a subset of (or full) core/[T1]/BDF/CJK/etc. font file can optionally be carried along as an embedded file? As with TTF/OTF fonts, don't forget to watch out for copyright restrictions on doing this!

terefang commented 1 year ago

it is currently only possible to construct a table for TTF embedded fonts via CIDFontType2 and a CIDToGIDMap. this will allow you to use UCS2/UTF16BE as CIDs.

for OTF you need to construct a CIDFontType0 with a CJK-style CMap. (i never managed to create working code, but i also never tried had enough relying on my cid-output mapper instead)

Core/T1/T3 fonts are limited to single-byte encodings.

to embed a Core font in a manner not infringing on Adobe rights, use the structures as is but then add properties as you would for a T1 font and embed a metrics-compatible font file from the TexGyre distribution instead. (that could be either pfb, ttf or otf – some linux distros contain pfb versions of those fonts for exactly that purpose)

PhilterPaper commented 1 year ago

Regardless of who owns the rights to a particular font (the font foundry which Adobe may have licenses with, or Adobe itself), copyright and licensing terms must of course be respected*. I don't know if anything is built into a font to indicate restrictions on embedding and permit automatic switching between embedding full/subset/none -- if there is, I don't think PDF::API2 or PDF::Builder is looking at it.

(*) I'm willing to provide functionality which might be used to circumvent copyright/license restrictions, but it should be up to the producer of the PDF whether or not they deliberately choose to violate terms. I don't want to be caught in the position of misleading or entrapping PDF::Builder users into making violations for which they might be prosecuted! An analogy might be that I sell you a nice big kitchen knife, whose stated use is to cut meat to recipe-sized chunks. If you choose to use it to murder someone, that's on you.

Note that some licenses allow only the glyphs used to be embedded as a subset, while others permit the entire font (or at least, a larger subset, such as the full ASCII). If you have a fillable form, not having all the characters someone might need to type in could be a problem! I'm not sure how much PDF::API2/Builder supports such forms, so that may currently be a moot point. In the future, once fillable forms are fully supported (if there's interest in it), this will have to be addressed.

Also note that Adobe is moving away from T1 (e.g., .pfb files), so I don't know if it would be a good idea to add a dependency on T1 fonts. I get the feeling that the whole field of T1 fonts may stagnate and few new ones will be produced. It's probably not worthwhile for me to put a lot of effort into adding UTF-8 capability to T1 fonts -- I'm hoping that much of the work I do for core fonts (multibyte encoding and embedding) can come along for T1 with minimal effort.

Finally, I don't know what to do about "CJK" fonts -- my understanding is that they are something of a half-way preliminary effort with TTF/OTF and may be a dead-end. It may not be worth doing much with them, and full TTF/OTF should be directly used instead. I haven't gotten much feedback from users of CJK fonts.

terefang commented 1 year ago

Regardless of who owns the rights to a particular font (the font foundry which Adobe may have licenses with, or Adobe itself), copyright and licensing terms must of course be respected*. I don't know if anything is built into a font to indicate restrictions on embedding and permit automatic switching between embedding full/subset/none -- if there is, I don't think PDF::API2 or PDF::Builder is looking at it.

in the very earls days of PDF::API it would look at the fsType flag of the OS/2 structure, but since at that time all those broken or ill maintained font files where used PDF::API2 would simply turn a blind eye on it and let the producer do what they want.

terefang commented 1 year ago

Also note that Adobe is moving away from T1 (e.g., .pfb files), so I don't know if it would be a good idea to add a dependency on T1 fonts. I get the feeling that the whole field of T1 fonts may stagnate and few new ones will be produced. It's probably not worthwhile for me to put a lot of effort into adding UTF-8 capability to T1 fonts -- I'm hoping that much of the work I do for core fonts (multibyte encoding and embedding) can come along for T1 with minimal effort.

Core Fonts are Type1 technology just without embedding.

Finally, I don't know what to do about "CJK" fonts -- my understanding is that they are something of a half-way preliminary effort with TTF/OTF and may be a dead-end. It may not be worth doing much with them, and full TTF/OTF should be directly used instead. I haven't gotten much feedback from users of CJK fonts.

i dont know if Adobe had CJK technology first or was trying to digest the TTC specification, but reading between the lines you can clearly see that there was an half-backed attempt to support CJK somehow with TTC and/or CFF which somehow stopped mid-spec.