PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 128674] error "requested cmap '' not installed" with many CJK fonts #98

Closed PhilterPaper closed 5 years ago

PhilterPaper commented 5 years ago

perlbug [...] jgreely.com (ssimms/pdfapi2#32)

PDF::API2 version 2.033 Font::TTF version 1.06 Perl version 5.20.2

Sample code:

#!/usr/bin/env perl 
use PDF::API2; 
$pdf = PDF::API2->new(); 
$pdf->page(); 

$pdf->ttfont("NotoSansJP-Medium.otf"); 
# downloaded from here: 
# https://github.com/googlei18n/noto-cjk/blob/master/NotoSansJP-Medium.otf

dies with:

requested cmap '' not installed at /Users/Shared/perlbrew/perls/perl-5.20.2/lib/site_perl/5.20.2/PDF/API2/Resource/CIDFont/TrueType/FontFile.pm line 27.

About a third of the CJK fonts I have don't work with PDF::API2. I used Noto Sans JP in the above example because it's freely available, but it's not the only one, and at least one other PostScript-flavored OTF font that I own does work (DFKyoKaShoStd-W4.otf). Is there a workaround or a fix?

-j

PhilterPaper commented 5 years ago

In FontFile.pm (as well as CJKFont.pm, if that one is called), the ROS (argument to _look_for_cmap()) is 'Adobe:Identity'. This is not one of the four supported CMAPs: Adobe:Japan1, Adobe:Korea1, Adobe:CNS1 (traditional), and Adobe:GB1 (simplified). https://github.com/adobe-type-tools/cmap-resources seems to contain the up-to-date CMAPs, so I'll have to take a look at what's involved in updating PDF::Builder's list of CMAPs from there (open source, just have to keep the copyrights). At the least, add Adobe:Identity, and possibly some others (as well as updates to the existing four). There's a lot of stuff there, so I want to understand what they are before I bloat the size of PDF::Builder with unnecessary CJK-related files. At the least, this is adding some more files to the CMAP directory, and updating the internal lists of available CMAPs in FontFile.pm and CJKFont.pm.

We may want to consider auto-generating the list of available CMAPs from reading the CMap directory (on the fly), so that users can add whatever CMAPs they want to that directory, rather than shipping PDF::Builder with everything under the sun.

PhilterPaper commented 5 years ago

There are some Perl tools for dealing with the CMap files at https://github.com/adobe-type-tools/perl-scripts, however, the output still doesn't look anything like the "cmap" files shipped with PDF::Builder. The PDF::Builder files contain Perl code mappings of Unicode-to-glyphID and vice-versa, while the Adobe files seem to be just CID ranges, and are in some sort of PostScript format (even after "conversion" with cmap-tool.pl). I just don't see anything listed among the tools that claims to turn the file into a PDF::Builder-compatible cmap file. So, I'm just going to have to declare myself stuck on this one until someone familiar with CMaps comes along and can explain what such a file is supposed to be. There doesn't seem to be the information needed for the CMap file (Unicode to/from GlyphID) in the Adobe files. Perhaps it can be gotten from a font in some manner?

PhilterPaper commented 5 years ago

Date: | Sun, 3 Mar 2019 14:23:23 +0100 From: | "Alfred Reibenschuh" <alfredreibenschuh [...] gmx.net>

replace the section

/lib/PDF/API2/Resource/CIDFont/TrueType/FontFile.pm in sub new #-------------------------------------------

if(defined $data->{cff}->{ROS}) { 
    my %cffcmap=(   'Adobe:Japan1'=>'japanese', 
                    'Adobe:Korea1'=>'korean',
                    'Adobe:CNS1'=>'traditional',
                    'Adobe:GB1'=>'simplified', 
                    'Adobe:Identity'=>'identity', # NEW CMAP 
    );     
    my $ccmap=_look_for_cmap($cffcmap{"$data->{cff}->{ROS}->[0]:$data->{cff}->{ROS}->[1]"});       
    $data->{u2g}=$ccmap->{u2g}; 
    $data->{g2u}=$ccmap->{g2u}; 
} else

-------------------------------------------

and create a new cmap file:

/lib/PDF/API2/Resource/CIDFont/CMap/identity.cmap

-------------------------------------------

$cmap->{identity}={ 
    'ccs' => [ 
         'Adobe',      # registry
         'Identity',    # ordering
         0,               # supplement
     ],
     'cmap' => { # perl unicode maps to adobe cmap (TBD)
         'ident'         =>  [
              'ident',
              'Identity'
         ],
     },
     'g2u' => [
         0x0000 .. 0xffff
     ],
     'u2g' => {
         map { $_ => $_ } (0x0000 .. 0xffff)
     }
 };

-------------------------------------------

it could be that the g2u/u2g data does not work and you have to include the raw array and map instead.


Alfred Reibenschuh

PhilterPaper commented 5 years ago

Sun Mar 03 11:29:42 2019 PMPERRY@cpan.org - Correspondence added

Hi Alfred,

Good to see you're still active in this area!

Unless "Identity" is a very different beast than the other CMaps, I suspect that this won't work. For example, u2g[x20] = space = GId[1], and the map is not monotonically increasing. Can someone show that it does work?

Per https://github.com/PhilterPaper/Perl-PDF-Builder/issues/98, I have found some CMap sources and tools from Adobe, but they don't seem to have the u2g/g2u information needed for the .cmap files used here. Do you have an idea of how to generate .cmap files? I'm thinking in terms of a tool that would take an Adobe CMap file that someone wants to use and generating a .cmap file from it. In any case, it would probably still be a good idea to add Adobe:Identity to the list.

regards, Phil Perry

terefang commented 5 years ago

not that active but this got me interrested because i had similar problems in java.

i have had a look at: https://github.com/adobe-type-tools/cmap-resources/blob/master/Adobe-Identity-0/CMap/Identity-H

which sugested my fix but dumping the cmap table from noto suggests otherwise.

hmm ... the original code was written under the premise that cff/otf files did not contain a cmap but have everything stuffed into the cff table.

for building cmaps i suggest reading: http://blogs.adobe.com/CCJKType/2012/03/building-utf32-cmaps.html

a more simple fix would be checking for Adobe-Identity-0 and the existance of a cmap and use that instead if present.

-- Alfred

PhilterPaper commented 5 years ago

OK, I can see that the "cidrange" blocks are giving the Unicode ("u") and equivalent CID/GlyphID ("g") values. However, that's not usable for PDF::Builder, as currently implemented. It needs explicit u2g[] and g2u[] mappings, and in Perl code. I kind of hate to use a preprocessor to expand those tens of thousands of entries into explicit u2g and g2u entries, although it could be done. I wonder if it might be better to handle "Identity" as a special case, where u=g (or whatever the proper mapping turns out to be), and g=fn_u2g(u) and u=fn_g2u(g) rather than using array lookups.

I haven't tried your code yet, but I'm worried about whether "Identity" actually has G+0 = U+0000, G+1 = U+0001, etc. As for the other CMaps, G+0 is .notdef, and the mapping to ASCII starts with G+1 at U+0020. "Identity" would indeed be a horse of a different color if it is this way. I guess there's only one way to find out for sure!

Existing .cmap files suggest that there is really nothing to be gained by replacing them with such functions, as there is very little pattern to u2g and g2u, and the functions would be just as large as the present tables. In that case, a one-off identity.cmap in the usual format, along with a tool to convert online CMap resources to .cmap files, might be just as good (rather than adding code to treat Identity as a special case). I'll have to think about it -- it might work if PDF::Builder scans for .cmap files during startup, and builds the ROS[0]:ROS[1]-to-filename list. That way, someone could add more .cmap files if they need them.

I will have to look what the "cmap" element is doing. It includes an "Identity" component, but I'm guessing that there's nothing implemented. The whole element is marked "TBD".

jgreely commented 5 years ago

FYI, I tried making Alfred's suggested change in PDF::API2, and it didn't work. NotoSansJP-Medium.otf loaded, but every character was shifted by 31 ("11" became "PP", etc), and the resulting PDF rendered very slowly.

PhilterPaper commented 5 years ago

That's off by decimal 31? That could be accounted for by G+1 should map to U+0020, etc. What happens if you change Alfred's code to

     'g2u' => [
         0xFFFD,
         0x0020 .. 0xFFFF
     ],
     'u2g' => {
         map { $_ => ($_-31) } (0x0020 .. 0xFFFF),
         0xFFFD => 0
     }

? I haven't actually tried it, but the syntax should be close.

jgreely commented 5 years ago

That's sufficient to fix ASCII, but completely wrong for kanji (月 = U+6708 renders as 玞 = U+739E). Still renders incredibly slowly as well (about 10 seconds in MacOS Preview.app compared to the similarly-sized DFKyoKaShoStd-W4.otf, which comes up instantly; PDF size is about the same).

PhilterPaper commented 5 years ago

So does this Noto Sans JP cover just ASCII and CJK (or even just Japanese alphabets), rather than all scripts? In other words, just a subset of all characters? I thought the intent of Noto was to provide glyphs for every Unicode character, in order to avoid tofus. Maybe that was just too big a font file. And why is it claiming "Identity" mapping 0-ffff:0-ffff if that's not what it's providing?

Let me try to take a look at the contents of this font file and see if my font-dump routines (in examples/) give any clue as to what's going on.

jgreely commented 5 years ago

FontExplorer Pro says it has 18,570 characters and covers Cyrillic, Hangul, Bopomofo, Greek, IPA extensions, etc, so most likely it's just using the Japanese flavor of CJK glyphs. Looks like they build separate localized fonts for each target country.

PhilterPaper commented 5 years ago

OK, I finally got examples/022_truefonts to dump all the glyphs in the NotoSansJP-Regular.otf font file. It reports 17802 glyphs. They are rather scattered around, with large gaps between some CJK characters (when ordered by CID), and others in large contiguous chunks. They do not appear to be in any order per the Unicode standard. For example, I see Katakana (U+30Dx neighborhood) appearing near the end at around G+65300 (0xFF10).

I must conclude that either 1) this Noto file is in fact not ordered in some sort of "Identity" one-to-one mapping, or 2) Alfred's .cmap file is missing something or broken, or 3) I/we just don't understand what "Identity" is supposed to be and do when it comes to CMaps.

Since Unicode does not define every single code point between U+0000 and U+FFFF, I would expect gaps in the CID (G+nnnnn) sequence, or possibly some sort of dummy placeholders in the gaps. If this font is claiming almost 18,000 glyphs, that should cover much of Unicode, but I think there should probably be many more than that. My Unicode 3.0 book claims 49,194 characters (I'll take their word for it -- I'm not going to count the damned things).

Anyone out there have any ideas on where to go from here?

jgreely commented 5 years ago

Note: same error with Adobe's free SourceHanSans-Medium.otf.

I'm just stabbing blindly here, but it looks like you need to parse the cmap table directly to get correct results.

I've attached a crude Perl script that parses the output of ttx from Adobe's Font Development Kit and generates a suitable identity.cmap for replacing the stub Alfred suggested. This mostly works with NotoSansJP-Medium.otf and SourceHanSans-Medium.otf, but has to be generated for each font; they're quite different internally. Interesting, in both cases, the character that comes out wrong is 金 (U+91D1) (my test script generates locale-specific calendars, so the only kanji are the days of the week; I'm sure there are a lot of other CJK errors I'm not seeing, but 6 out of 7 worked).

ttx -q -t cmap -o - NotoSansJP-Regular.otf 2>/dev/null | ./cmap2perl.pl > identity.cmap
#!/usr/bin/env perl

use strict;

my %g2u;
my $incmap = 0;
while (<>) {
    if (m|<cmap_format_4 platformID="0"|) {
        $incmap++;
        next;
    }elsif (m|</cmap_format_4>|) {
        last;
    }
    # <map code="0x29" name="cid00010"/><!-- RIGHT PARENTHESIS -->
    my ($code,$id) = m|<map code="0x([0-9a-fA-F]+)" name="cid(\d+)"|;
    $g2u{$id} = sprintf("0x%04X",hex($code));
}

print<<'EOF';
$cmap->{identity}={ 
    'ccs' => [ 
         'Adobe',      # registry
         'Identity',    # ordering
         0,               # supplement
     ],
     'cmap' => { # perl unicode maps to adobe cmap (TBD)
         'ident'         =>  [
              'ident',
              'Identity'
         ],
     },
     'g2u' => [
EOF
foreach my $id (sort {$a <=> $b} keys %g2u) {
    print "        $g2u{$id},\n";
}
print <<EOF;
     ],
     'u2g' => {
EOF
foreach my $id (sort {$g2u{$a} cmp $g2u{$b}} keys %g2u) {
    printf("        '%d' => '%d',\n",oct($g2u{$id}),$id);
}
print <<EOF;
     }
 };
EOF
PhilterPaper commented 5 years ago

If there is no one 'identity.cmap', but a new one has to be generated for each font file, well, that's absurd. We might as well just supply some tools with PDF::Builder and tell users to build their own .cmap files. I wonder if it would be better to just get rid of the whole .cmap file business. All it seems to be is a mapping of Unicode number to CID (u2g) and the corresponding inverse (g2u). Isn't all that information available in a TTF or OTF font file anyway? Or is the Unicode number for each glyph missing in at least some font files? Was the assumption that a "producer" might not have the appropriate font file in hand? If so, where do character widths and other information come from?

It makes me wonder if the supplied Japanese, Korean, and (two flavors of) Chinese .cmap files have errors in them, especially when applied to fonts following later revisions of the CMap standards (e.g., japanese.cmap is Rev 6, while the current is Rev 7). If they're "clean", why is Identity so fubarred? Note that even Latin-1 (Windows-1252/standard encoding) doesn't get done correctly for NotoSansJP in examples/022_truefonts!

jgreely commented 5 years ago

Playing with adobe's perl scripts (in particular, I rewrote my quick hack to use the output of unicode-list.pl -g), it looks like:

  1. an "identity" cmap is a framework for building mappings, not an actual mapping,

  2. glyph IDs are not a continuous range from 0..$num_glyphs,

  3. A single glyph ID can map to multiple Unicode code points. For instance, in SourceHanSans, glyph 22397 => [0x6A02, 0xF914, 0xF95C, F9BF].

PhilterPaper commented 5 years ago

So where does this leave us? Is a .cmap file applicable only to TrueType/OpenType font file(s) that use (or permit) that particular mapping? If that's the case, this is big trouble. Do TTF/OTF contain the Unicode point itself (if there is one) for each glyph ('cmap' field)? If we can read in the Unicode value from the font file, we can dispense with the .cmap files, but then you need to have the TTF or OTF file in hand to generate the PDF (was that why .cmap was handled as a separate file?).

It is true that there is not a one-to-one mapping between glyphs and Unicode points. There are many ligatures, swashes, and other such typographic effects which have no Unicode (or map to multiple Unicode, such as a 'ct' ligature to [0x0065, 0x0074]). Unless the PDF author can specify by Glyph ID (CID), which will be font-specific, they won't even be able to put such glyphs into the PDF (that sounds like a possible PDF::Builder enhancement, to specify glyphs by ID, including choosing swashes). Note that even if an author can specify a character by glyph (font file specific), or automatic substitution can be made (such as ligatures), the Unicode value(s) are still needed so that the PDF can be searched. Online documentation for 'cmap' suggests that even within a font file, there may be multiple cmaps (mappings between Unicode and CIDs).

The problems with "Identity" aside, can anyone tell if the four .cmap files currently supplied with PDF::Builder (and PDF::API2) are still valid, and for all font files claiming to use those mappings?

It's beginning to sound like this whole thing is fast becoming a big mess. I have a tendency to overthink these things and go down the rabbit hole of getting too complex a solution, so I'm hoping there is something better to do. Alfred, care to chime in on the subject, as you know the history of it?

jgreely commented 5 years ago

Does Font::FreeType provide everything you need?

PhilterPaper commented 5 years ago

Font::FreeType and Font::FreeType::Glyph don't appear to offer anything I need (unless I'm badly misreading their descriptions). Maybe Font::TTF::Cmap will do the job? Font::TTF is already a prereq.

What we're starting with is the Unicode value of a character we wish to output, and at some point in the process we need a CID/Glyph ID to tell the font exactly which one to use (the CID is what's output in the PDF file). For now, let's ignore the issue of automatic ligatures and other substitutions (GSUB) and positioning (GPOS) vital to Indic (#35) and Arabic family languages, optional ligatures for English, etc. (and their suppression), the choice of swashes and other glyph variants, and the use of font-specific glyphs with no Unicode value. Obviously, the encoding you use for your text needs to be one that has the correct mapping available, whether it's an external (xxx.cmap) or internal (cmap table). Once the appropriate glyph is found, its metrics (width) can be obtained.

I'm hoping that the Unicode information (mapping) I need can always be found in the font file, and we fall back to the .cmap files only if the font file is not at hand during the production of the PDF. In that case, where font metrics come from in that case is an interesting question -- is it assumed that it's a fixed-pitch font? Would it be a great hardship to eliminate .cmap files and require that the font be present for the PDF writer?

jgreely commented 5 years ago

How would you even generate a PDF if the font file wasn't available when the script ran?

jgreely commented 5 years ago

Answering the earlier question about the coverage of Noto Sans JP, Google has multiple packaging options for Noto, but the one I picked turned out to be the most generically compatible option.

PhilterPaper commented 5 years ago

How would you even generate a PDF if the font file wasn't available when the script ran?

It's done all the time with core files, which are usually TTF behind the scenes. PDF::Builder supplies its own mapping (Latin-1) and metrics files, and it is assumed that the appropriate file will be found on the reader's machine. I haven't gone through the deep details, but my understanding is that the writer doesn't read (or embed) the font file.

For TTF/OTF, Type1, etc., I'm not sure if they actually require the font to be present on the machine writing the PDF. Possibly they do, otherwise where would they get the metrics?

In general, font handling is a mess. It would be nice to have (at least) TTF/OTF and Type1/PS handled identically, or at least as close to that as possible. Core should probably not be used for serious publishing work, as the encoding is limited and metrics may not match up. Bitmapped (BDF) is probably only for novelty effects.

jgreely commented 5 years ago

Well, yes, but corefont() doesn't matter for this issue, because it's guaranteed that a PDF reader has a metrics-compatible font in some format; that predates Acrobat 1.0 from when they were the PostScript core fonts that shipped with every printer. I don't remember which version of Acrobat first bundled the fonts used by cjkfont(), but it was back in the Nineties.

If you call PDF::API2::ttfont(), you are definitely loading a complete font file from disk:

sub new {
    my ($class,$pdf,$file,%opts)=@_;
    my $data={};
    die "cannot find font '$file' ..." unless(-f $file);
    my $font=Font::TTF::Font->open($file);
    ...
PhilterPaper commented 5 years ago

OK, but to get the discussion back on track,

  1. can we come up with generic *.cmap files (especially identity) that are guaranteed to work?
  2. if not, can we read any TTF/OTF file to get its mapping(s), and compare to what the PDF writer is claiming for the encoding used? This would generate u2g and g2u lists on the fly. Do Type1 font files include this information?

Core fonts aside (whose use should probably be discouraged), it looks like other font methods need to have the font files present anyway at PDF generation.

Most TTF/OTF seem to have Unicode mapping(s) available, so it might be that we have to convert any single-byte encoded text to UTF-8 on the fly. We just have to be careful to leave room for GSUB (many-to-one Unicode-to-glyph possible) and GPOS work, optional ligatures and swashes, etc., and make sure the correct text is available for searching (e.g., 'ffl' ligature searched as f+f+l). Embedding fonts is good, too (since synthetic fonts can be embedded, perhaps Type1 can be too?).

jgreely commented 5 years ago
  1. No, I don't think so. Apart from identity, the old ones seem to work well enough (I've printed entire Japanese novels with PDF::API2 without spotting any incorrect kanji mappings), so maybe the short term solution is simply to provide a better error message for anything that falls through to "identity".

  2. Type 1 fonts had a relatively small number of well-defined encodings, with all the interesting goodies stored in separate AFM/TFM/XFM files (and all sorts of workarounds for the limitations that imposed). I'm not even sure what box my old PostScript manuals are in these days, so I can't be more specific. Fortunately that's a completely different code path that doesn't involve g2u/u2g, and anyone wanting to use Unicode text won't miss them.

terefang commented 5 years ago

new resolution proposal:

if (defined $data->{'cff'}->{'ROS'}) {

to

if ((defined $data->{'cff'}->{'ROS'}) && ($data->{'cff'}->{'ROS'}->[1] ne 'Identity')) {

then if identity pops up, the opentype cmap will be used instead.

jgreely commented 5 years ago

That works for SourceHanSans, and at least gets NotoSansJP to render alphanumerics (but not kanji).

Out of curiosity, I tried the NotoSansCJKjp packaging, and that one does work, including kanji.

PhilterPaper commented 5 years ago

Hmm. Looking at (Alfred's revised) code, is it possible that there was no real 'cmap' in the file, and thus u2g and g2u were not populated? That is, it was depending on there being an identity.cmap of some sort? Or perhaps there were other version(s) of the cmap than the expected MS version (->find_ms())?

Certainly, we could restrict .cmap usage to the four currently supported versions, and otherwise look for the font file's cmap section(s), but apparently we can't depend on find_ms always working? If there is no MS cmap, what other ones should we look for? At the least, we could check that u2g and g2u of some sort were successfully loaded, before continuing.

Add: First, a correction. Apparently (according to the Font::TTF::Cmap documentation) find_ms() looks first for the MS cmap, and then tries others. If there are other cmaps, it should use one of them.

I did some playing around, dumping cmap data in TrueType/FontFile.pm in the non-.cmap section. My file NotoSansJP-Regular.otf has 5 tables in the cmap section. The first 3 are Platform 0 and the last 2 are Platform 3. All are Ver 0. The encodings are 3,4,5,1,10. ms_enc() reports that the encoding is 10, so apparently it went with the fifth table.

Dumping out each u2g entry as it was created, there are a lot of gaps in the Unicode sequence once you get past Latin-1, but the Glyph IDs increase fairly steadily (although I did see some out-of-sequence). There are many Unicode entries in 1F1xx and 2xxxx ranges. All told there are 16697 entries, both by count and getting the size of u2g. I don't know if there are any duplicate glyphs or if they're all unique.

P.S. There is a third ROS array value, some sort of Revision number or something. Currently that's ignored when deciding to use, say, japanese.cmap. I'm wondering if using a later (or just different) revision number might cause problems.

terefang commented 5 years ago

i have done some further research (reading otf spec, dumping/diffing fonts cmap).

seams like the age of Font::TTF is showing ...

find_ms is picking the Windows NT cmap (1 = Unicode 2.0 BMP) or MS Unicode (10 = says Unicode full complement).

the cmaps 3:10 and 0:4 from noto are identical, but i would prefer 0:4 and 0:5 over 3:10 for unicode context.

also Font::TTF (as of 1.06) does not support format 14 cmaps (notos 0:5) from the otf spec.

the otf spec still refers to unicode 2.0 -- :-((

so actually two things need to happen (in that order)

1 - patch to builder and API2 to switch to font cmap if cff encoding is identity (as above). 2 - patch Font::TTF:Cmap find_ms to prefer (platform:encoding) of 0:4 instead of 3:10/3:1. 3 - patch Font::TTF:Cmap to support format 14 cmaps and add preference for 0:5 above all.

c

PhilterPaper commented 5 years ago

Alfred, are your observations on preferable choices based only on looking at Noto fonts, or are they universal? We don't want to change PDF::Builder and/or Font::TTF based only on Noto (which could turn out to be an odd duck). If it's Noto-specific, perhaps an option could be added to ttfont() to prefer different cmaps (and/or .cmap files) based on the specific font you're using?

  1. No problem, but I'm becoming concerned about the age and currency of the .cmap files. Most of them have later revisions out (I'm assuming ROS[2] is a revision number of some sort). Should we consider preferably using the font's built-in cmap, and using the .cmap files only has a fallback? Why are there .cmap files in the first place, and do they still serve a purpose?
  2. If this is universally good, and you can get Bob Hallissy et al. to quickly change Font::TTF, that would work. Otherwise, I think there is enough information in the font file to bypass find_ms() and use our own code to do what you want (find the appropriate cmap, and populate u2g and g2u).
  3. I have no idea what's involved with this. Again, it's probably possible to bypass Font::TTF functions and work directly with the font data, if Bob can't provide a quick update.
terefang commented 5 years ago

Alfred, are your observations on preferable choices based only on looking at Noto fonts, or are they universal? We don't want to change PDF::Builder and/or Font::TTF based only on Noto (which could turn out to be an odd duck). If it's Noto-specific, perhaps an option could be added to ttfont() to prefer different cmaps (and/or .cmap files) based on the specific font you're using?

the observations are universal based on the otf-spec 1.8.3 (https://docs.microsoft.com/en-us/typography/opentype/spec/)

the real problem is that this breaks down based on the particular font-files themselves, since some are highly opinionated relative to one ttf/otf-spec or the other, based on apple or windows with unix/linux in between, or what works best observations on various points in time from 1991 to today.

  1. No problem, but I'm becoming concerned about the age and currency of the .cmap files. Most of them have later revisions out (I'm assuming ROS[2] is a revision number of some sort). Should we consider preferably using the font's built-in cmap, and using the .cmap files only has a fallback? Why are there .cmap files in the first place, and do they still serve a purpose?
  1. If this is universally good, and you can get Bob Hallissy et al. to quickly change Font::TTF, that would work. Otherwise, I think there is enough information in the font file to bypass find_ms() and use our own code to do what you want (find the appropriate cmap, and populate u2g and g2u).
  2. I have no idea what's involved with this. Again, it's probably possible to bypass Font::TTF functions and work directly with the font data, if Bob can't provide a quick update.

i have tested Font::TTF 1.06 and besides not supporting the format 14 cmap it is as good as you may want it.

i have looked at the format 14 spec and it deals largely with variant glyphs which could impact cjk and/or arabic probably some more, but i am not an expert on this.

with the identity fix in place, the general assumption can be made that newer cff-fonts will be better supported as they seam to have internal cmaps, so you should be fine unless someone starts using the cff2 format (which is currently only an apple thingy).

ttf/otf technology stopped to be an exact science at the point adobe, ms, and apple shared the spec.

PhilterPaper commented 5 years ago

the .cmap files should be updated to their latest revision, since they are still needed to support legacy fonts and cff-fonts without an internal cmap. AFAIK each revision build on top of the former so higher revisions should be compatible with lower revisions besides the actual glyph-counts.

OK, I'll take a look at whether I can generate new .cmap files from the online sources. You wouldn't happen to still have any utilities for doing this? Are the four .cmap files sufficient (when updated), or should we support others? If so, it would probably be better just to ship some generator utility and let users create their own, rather than bloating PDF::Builder with lots of new .cmap files.

i have looked at the format 14 spec and it deals largely with variant glyphs which could impact cjk and/or arabic probably some more, but i am not an expert on this.

Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.

terefang commented 5 years ago

OK, I'll take a look at whether I can generate new .cmap files from the online sources. You wouldn't happen to still have any utilities for doing this? Are the four .cmap files sufficient (when updated), or should we support others? If so, it would probably be better just to ship some generator utility and let users create their own, rather than bloating PDF::Builder with lots of new .cmap files.

my former perl utilities are all lost to bit-rot or hd-failures.

you should get away with having only to update those already present.

Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.

the only code that i know that implements this correctly is either "harfbuzz" or "libicu" (both C/C++).

bobh0303 commented 5 years ago

so actually two [sic] things need to happen (in that order)

1 - patch to builder and API2 to switch to font cmap if cff encoding is identity (as above). 2 - patch Font::TTF:Cmap find_ms to prefer (platform:encoding) of 0:4 instead of 3:10/3:1. 3 - patch Font::TTF:Cmap to support format 14 cmaps and add preference for 0:5 above all.

Re (3):

We're open to proposals (or PRs) for how to support format 14 cmaps. The immediate problems are that

A) the current interface assumes a cmap is a mapping from a codepoint to a glyph:

val
A hash keyed by the codepoint value (not a string) storing the glyph id

Format 14 cmaps are not such, but instead map each supported Unicode Variation Sequence (UVS) to a glyph.

B) the format 14 cmap is only a partial cmap that supplies the non-default glyphs for those UVS that override the default mapping contained in the font's "Unicode cmap" (quoting the spec).

Whatever structure we decide to represent the format 14 cmap, I don't think find_ms() should ever return such, but should return the corresponding Unicode cmap.

Re (2):

I'm not sure I understand the arguments for changing the preferences as suggested -- feel free to create an issue on Font::TTF and make such arguments.

You can, of course, write your own code to find your preferred cmap. And, in case you're not in control of all the calls to find_ms(), Perl allows you to replace Font::TTF::Cmap:find_ms with your own function.

PhilterPaper commented 5 years ago

Hi Bob, thanks for joining in and giving us the Font::TTF perspective!

Alfred, I've been trying to generate replacement *.cmap files, but I haven't quite gotten there. The best data seems to be the "cid2code.txt" files, but I haven't been able to quite match the existing files. For example, trying to create a new japanese.cmap, I find the cid2code file has multiple Unicode values for some CIDs, and there seems no rhyme or reason to which one is in japanese.cmap for that CID. Some CIDs have no Unicode values at all, while some value is given in japanese.cmap (which I don't know where it came from). I've checked the first few hundred entries, and found quite a few of these problems. No way am I going to check all twenty-thousand-plus entries by hand! I've got to get something that can be fully automated.

bobh0303 commented 5 years ago

I did some playing around, dumping cmap data in TrueType/FontFile.pm in the non-.cmap section. My file NotoSansJP-Regular.otf has 5 tables in the cmap section. The first 3 are Platform 0 and the last 2 are Platform 3. All are Ver 0. The encodings are 3,4,5,1,10. ms_enc() reports that the encoding is 10, so apparently it went with the fifth table.

fwiw, I retrieved https://github.com/googlei18n/noto-cjk/blob/master/NotoSansJP-Regular.otf and note there are 6 cmaps:

Note that the 0/3 and 3/1 maps are the exact same data in the font (not just a copy -- the two cmap headers point to the same location in the file), and similarly for 0/4 and 3/10.

PhilterPaper commented 5 years ago

Hmm. I don't recall seeing that last one (platform 1). Does Font::TTF properly handle it? It may be a moot point, if Apple doesn't want it used any more. As I said, Font::TTF apparently chose 3/10.

terefang commented 5 years ago

looking at https://docs.microsoft.com/en-us/typography/opentype/spec/name#platform-specific-encoding-and-language-ids-unicode-platform-platform-id--0

so the preference order would be 0/6, 0/4, 3/10, 0/3, 3/1 for script fonts and 3/0 for symbol fonts. in microsoft environments the preference would change to 0/6, 3/10, 0/4, 3/1, 0/3 for script fonts

terefang commented 5 years ago

i recommend using the Adobe-*-UCS2 files from: https://github.com/apache/pdfbox/tree/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap for cmap generation

PhilterPaper commented 5 years ago

i recommend using the Adobe-*-UCS2 files from: https://github.com/apache/pdfbox/tree/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap for cmap generation

  1. Is, for example, Adobe-Japan1-UCS2 the follow-on to Adobe-Japan1-6? They don't seem to be in the same format, and other sources have a Adobe-Japan1-7 (though not with the information I need).
  2. In Adobe-Japan1-UCS2, I don't see anything that quite looks like a CID-to-Unicode mapping that I need. Do you know how to read this file? The file I was trying to use has a decimal CID, with a variety of Unicode (and other) values for each (although sometimes missing any, and sometimes multiple Unicode points)

Clear? Huh! Why a four-year-old child could understand this report. Run out and find me a four-year-old child. I can't make head or tail out of it. -- Groucho Marx (Rufus T. Firefly, Duck Soup)

PhilterPaper commented 5 years ago

Bob Hallissy on RT system said:

<URL: https://rt.cpan.org/Ticket/Display.html?id=128768 >

AFAICT, the request is to change the order in which find_ms() locates a cmap, specifically to prefer the PlatformID=0 ("Unicode") cmaps over the PlatformID=3 ("Microsoft") cmaps.

At this point I haven't seen any reasons put forward as to why this change was being requested, and since there are lots of fonts that have Microsoft cmaps but no Unicode cmaps, I see no benefit in making the change.

Additionally, application writers can, of course, write their own code to find their preferred cmap. And, in case they're not in control of all the calls to find_ms(), they can replace Font::TTF::Cmap:find_ms with their own function.

128674 has been rejected. It sounds like Font::TTF will not be modified to accommodate Alfred's suggested revision. So where does that leave us? Should we replace the find_ms() method with our own version, and if so, what is the justification? Keep in mind that PDF::Builder can be used on a wide range of platforms, including Linux, Mac, and Windows. A fixed lookup sequence may not be desirable. I assume that the Unicode-to-CID mapping is only of concern to the PDF writer (producer), and irrelevant to the reader (at least, if the font is embedded... what happens if the font has to be on the reader (consumer) machine?).

Further thoughts, given that Font::TTF itself is not going to change? And does this tie in at all with the *.cmap file usage?

bobh0303 commented 5 years ago

If there is a functional reason to change find_ms() I'm open to doing it but I haven't heard a reason yet.

I guess it boils down to what behavior or problem are you having with the current code? It sounds to me like it is just a personal preference or bias against PlatformID=3. What am I not seeing?

PhilterPaper commented 5 years ago

I have no idea. Only Alfred (@terefang) can tell us the reasons for his request. My skin in the game is ensuring that PDF::Builder works correctly for as many people as possible -- I'm open to any and all reasonable ideas.

bobh0303 commented 5 years ago

Here ya go... https://gist.github.com/bobh0303/9e29be8b3dbf161a0a1cfe69f8c77153

PhilterPaper commented 5 years ago

So Alfred, would Bob's find_cmap() routine (I think it stands independently of Font::TTF innards) do the job for you? Would the lookup list be an option to ttfont() or a new parameter or something else? And how would you prefer to handle MS vs non-MS systems? Does this pertain to just the producer (writer), or also to the consumer (reader) depending on whether or not the font is embedded? What should default behavior be (e.g., same as today)?

Once the selection of a built-in cmap is out of the way, what remains to be done with the four *.cmap files? I still don't have a good update for them. Should they continue to be used in preference to any built-in font cmap, but just for those 4 specific cases?

bobh0303 commented 5 years ago

Bob's find_cmap() routine (I think it stands independently of Font::TTF innards)

Actually it does one thing that depends on knowledge of Font::TTF innards: it sets an internal object variable (' mstable') so as to remember the cmap that was actually found. This variable is then used by Font::TTF::Cmap::ms_lookup() to find glyphs using the designated cmap.

I did this on the assumption that your code is actually using the ms_lookup() function. If it is not, it would be cleaner to adjust the code in the gist to not set $cmap->{' mstable'} and then it would be completely free of any innards knowledge.

terefang commented 5 years ago

yes

On Sat, Mar 16, 2019 at 1:12 AM Phil Perry notifications@github.com wrote:

So Alfred, would Bob's find_cmap() routine (I think it stands independently of Font::TTF innards) do the job for you? Would the lookup list be an option to ttfont() or a new parameter or something else? And how would you prefer to handle MS vs non-MS systems? Does this pertain to just the producer (writer), or also to the consumer (reader) depending on whether or not the font is embedded? What should default behavior be (e.g., same as today)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PhilterPaper/Perl-PDF-Builder/issues/98#issuecomment-473478167, or mute the thread https://github.com/notifications/unsubscribe-auth/AAeCerf9QxGCF8jbY2Zd014YfakUyqIiks5vXDbcgaJpZM4bZS4z .

--

Document My Code? Why do you think they call it "code" ?

App developers spend too much time debugging errors in production systems https://betanews.com/2016/11/03/developers-debugging-production-errors/

“The Principle of Priority states (a) you must know the difference between what is urgent and what is important, and (b) you must do what’s important first.” Steven Pressfield (born 1943) American writer

PhilterPaper commented 5 years ago

So how about tweaking the PDF::Builder code to do the following:

  1. Unless a (new) option to ttfont says otherwise, try to use the existing 4 .cmap files* for CJK fonts.
  2. If .cmap files fail for CJK, and for all others, try the user-supplied list of internal cmaps (if given as a new option, otherwise fall back to a default list), using Bob's gist code (is it then not necessary to set mstable?).
  3. If still no joy, call find_ms as before.

Would this work? What I'm still unclear about is when to specify a "Microsoft" cmap list and when to specify a "non-Microsoft" cmap list. Should PDF::Builder attempt to discover what platform it's running on? That could work if the font is to be embedded, but what if we specify "-noembed"? I suspect that the current code would favor the Microsoft cmap list. Should we ask the user for both lists to use, rather than just one, if they choose to specify their own cmap list, or do we leave the burden of figuring out what platform to the user (who supplies only the appropriate list)?

Should we just always force fonts to be embedded? That is, no/op "-noembed"? Are there likely to be CID (glyph number) mismatches if the font is not embedded?

so the preference order would be 0/6, 0/4, 3/10, 0/3, 3/1 for script fonts and 3/0 for symbol fonts. in microsoft environments the preference would change to 0/6, 3/10, 0/4, 3/1, 0/3 for script fonts

These would be the default cmap list: always 0/6 first, followed by 0/4 and 3/10 (reverse for MS environment), followed by 0/3 and 3/1 (reverse for MS environment). Would a symbol font have only 3/0 anyway (tack it on to the end of both lists), or do we need to look at the font and see if it's symbol?

* I still need to update the .cmap files

PhilterPaper commented 5 years ago

I can't move forward with implementing this until I have an idea of what would be considered proper behavior of the code.

  1. Is there any point in continuing to allow -noembed if (?) it permits a mismatch of CIDs and glyphs?
  2. If the font must be embedded, does it make sense for the code to determine MS/non-MS platform, and choose the appropriate default cmap list? If the user gives two lists, it would pick the right one; if the user gives one list, assume they already know which platform they're on.
  3. Is the given use/fallback list (4 given .cmaps unless "no, don't", user-specified cmap list, default cmap list, find_ms() function) a good one? Does it cover all the bases?
  4. For symbol fonts, is it safe to simply add "3/0" on to the end of the cmap list, or should the code do something to determine if this is a symbol font, and treat it differently? That is, will a symbol font have only a 3/0 cmap, and no non-symbol font will have a 3/0 cmap?

And where do I find the information to build updated .cmap files? Add: Is there a guarantee that a given Unicode point will always map to the same CID (and thus, correct glyph)? Per point 1, why would there be a fixed mapping for four specific CJK alphabets, but not for other alphabets? I'm wondering why we shouldn't use the built-in cmaps for those fonts, unless the problem is that they specifically don't have cmaps.

PhilterPaper commented 5 years ago

I have updated FontFile.pm to look in .cmap files (if -usecmf=>1), then in the list of Platform/Encodings given by -cmaps (or its default list), then using find_ms(), and finally back to .cmap files (regardless of -usecmf setting). One of those should hopefully find a workable CMap! There is also a -debug flag to show diagnostic information in the hunt for a CMap. I am leaving the ticket open for now as a reminder that the four .cmap files are still in need of updating, and I have not yet found a suitable data source (Unicode/glyphID mapping) to generate new .cmap files. Many thanks to Alfred Reibenschuh (original PDF::API2 author) and Bob Hallissy (Font::TTF author) for their assistance.

Update: The issue of whether something should be done with -noembed remains open. Perhaps it should be no-op'd, but I'd like to get some feedback first.

PhilterPaper commented 5 years ago

I found some sources from Adobe and I think I have the four .cmap files updated to current status (2019-05-29). Therefore I am closing this issue (in GitHub, 3.016 release). Please feel free to reopen (or open a new one) if you find a better update to these files.