Closed PhilterPaper closed 5 years ago
In FontFile.pm (as well as CJKFont.pm, if that one is called), the ROS (argument to _look_for_cmap()) is 'Adobe:Identity'. This is not one of the four supported CMAPs: Adobe:Japan1, Adobe:Korea1, Adobe:CNS1 (traditional), and Adobe:GB1 (simplified). https://github.com/adobe-type-tools/cmap-resources seems to contain the up-to-date CMAPs, so I'll have to take a look at what's involved in updating PDF::Builder's list of CMAPs from there (open source, just have to keep the copyrights). At the least, add Adobe:Identity, and possibly some others (as well as updates to the existing four). There's a lot of stuff there, so I want to understand what they are before I bloat the size of PDF::Builder with unnecessary CJK-related files. At the least, this is adding some more files to the CMAP directory, and updating the internal lists of available CMAPs in FontFile.pm and CJKFont.pm.
We may want to consider auto-generating the list of available CMAPs from reading the CMap directory (on the fly), so that users can add whatever CMAPs they want to that directory, rather than shipping PDF::Builder with everything under the sun.
There are some Perl tools for dealing with the CMap files at https://github.com/adobe-type-tools/perl-scripts, however, the output still doesn't look anything like the "cmap" files shipped with PDF::Builder. The PDF::Builder files contain Perl code mappings of Unicode-to-glyphID and vice-versa, while the Adobe files seem to be just CID ranges, and are in some sort of PostScript format (even after "conversion" with cmap-tool.pl). I just don't see anything listed among the tools that claims to turn the file into a PDF::Builder-compatible cmap file. So, I'm just going to have to declare myself stuck on this one until someone familiar with CMaps comes along and can explain what such a file is supposed to be. There doesn't seem to be the information needed for the CMap file (Unicode to/from GlyphID) in the Adobe files. Perhaps it can be gotten from a font in some manner?
Date: | Sun, 3 Mar 2019 14:23:23 +0100 From: | "Alfred Reibenschuh" <alfredreibenschuh [...] gmx.net>
replace the section
if(defined $data->{cff}->{ROS}) {
my %cffcmap=( 'Adobe:Japan1'=>'japanese',
'Adobe:Korea1'=>'korean',
'Adobe:CNS1'=>'traditional',
'Adobe:GB1'=>'simplified',
'Adobe:Identity'=>'identity', # NEW CMAP
);
my $ccmap=_look_for_cmap($cffcmap{"$data->{cff}->{ROS}->[0]:$data->{cff}->{ROS}->[1]"});
$data->{u2g}=$ccmap->{u2g};
$data->{g2u}=$ccmap->{g2u};
} else
and create a new cmap file:
$cmap->{identity}={
'ccs' => [
'Adobe', # registry
'Identity', # ordering
0, # supplement
],
'cmap' => { # perl unicode maps to adobe cmap (TBD)
'ident' => [
'ident',
'Identity'
],
},
'g2u' => [
0x0000 .. 0xffff
],
'u2g' => {
map { $_ => $_ } (0x0000 .. 0xffff)
}
};
it could be that the g2u/u2g data does not work and you have to include the raw array and map instead.
Alfred Reibenschuh
Sun Mar 03 11:29:42 2019 PMPERRY@cpan.org - Correspondence added
Hi Alfred,
Good to see you're still active in this area!
Unless "Identity" is a very different beast than the other CMaps, I suspect that this won't work. For example, u2g[x20] = space = GId[1], and the map is not monotonically increasing. Can someone show that it does work?
Per https://github.com/PhilterPaper/Perl-PDF-Builder/issues/98, I have found some CMap sources and tools from Adobe, but they don't seem to have the u2g/g2u information needed for the .cmap files used here. Do you have an idea of how to generate .cmap files? I'm thinking in terms of a tool that would take an Adobe CMap file that someone wants to use and generating a .cmap file from it. In any case, it would probably still be a good idea to add Adobe:Identity to the list.
regards, Phil Perry
not that active but this got me interrested because i had similar problems in java.
i have had a look at: https://github.com/adobe-type-tools/cmap-resources/blob/master/Adobe-Identity-0/CMap/Identity-H
which sugested my fix but dumping the cmap table from noto suggests otherwise.
hmm ... the original code was written under the premise that cff/otf files did not contain a cmap but have everything stuffed into the cff table.
for building cmaps i suggest reading: http://blogs.adobe.com/CCJKType/2012/03/building-utf32-cmaps.html
a more simple fix would be checking for Adobe-Identity-0 and the existance of a cmap and use that instead if present.
-- Alfred
OK, I can see that the "cidrange" blocks are giving the Unicode ("u") and equivalent CID/GlyphID ("g") values. However, that's not usable for PDF::Builder, as currently implemented. It needs explicit u2g[]
and g2u[]
mappings, and in Perl code. I kind of hate to use a preprocessor to expand those tens of thousands of entries into explicit u2g and g2u entries, although it could be done. I wonder if it might be better to handle "Identity" as a special case, where u=g (or whatever the proper mapping turns out to be), and g=fn_u2g(u)
and u=fn_g2u(g)
rather than using array lookups.
I haven't tried your code yet, but I'm worried about whether "Identity" actually has G+0 = U+0000, G+1 = U+0001, etc. As for the other CMaps, G+0 is .notdef, and the mapping to ASCII starts with G+1 at U+0020. "Identity" would indeed be a horse of a different color if it is this way. I guess there's only one way to find out for sure!
Existing .cmap files suggest that there is really nothing to be gained by replacing them with such functions, as there is very little pattern to u2g and g2u, and the functions would be just as large as the present tables. In that case, a one-off identity.cmap in the usual format, along with a tool to convert online CMap resources to .cmap files, might be just as good (rather than adding code to treat Identity as a special case). I'll have to think about it -- it might work if PDF::Builder scans for .cmap files during startup, and builds the ROS[0]:ROS[1]-to-filename list. That way, someone could add more .cmap files if they need them.
I will have to look what the "cmap" element is doing. It includes an "Identity" component, but I'm guessing that there's nothing implemented. The whole element is marked "TBD".
FYI, I tried making Alfred's suggested change in PDF::API2, and it didn't work. NotoSansJP-Medium.otf loaded, but every character was shifted by 31 ("11" became "PP", etc), and the resulting PDF rendered very slowly.
That's off by decimal 31? That could be accounted for by G+1 should map to U+0020, etc. What happens if you change Alfred's code to
'g2u' => [
0xFFFD,
0x0020 .. 0xFFFF
],
'u2g' => {
map { $_ => ($_-31) } (0x0020 .. 0xFFFF),
0xFFFD => 0
}
? I haven't actually tried it, but the syntax should be close.
That's sufficient to fix ASCII, but completely wrong for kanji (月 = U+6708 renders as 玞 = U+739E). Still renders incredibly slowly as well (about 10 seconds in MacOS Preview.app compared to the similarly-sized DFKyoKaShoStd-W4.otf, which comes up instantly; PDF size is about the same).
So does this Noto Sans JP cover just ASCII and CJK (or even just Japanese alphabets), rather than all scripts? In other words, just a subset of all characters? I thought the intent of Noto was to provide glyphs for every Unicode character, in order to avoid tofus. Maybe that was just too big a font file. And why is it claiming "Identity" mapping 0-ffff:0-ffff if that's not what it's providing?
Let me try to take a look at the contents of this font file and see if my font-dump routines (in examples/) give any clue as to what's going on.
FontExplorer Pro says it has 18,570 characters and covers Cyrillic, Hangul, Bopomofo, Greek, IPA extensions, etc, so most likely it's just using the Japanese flavor of CJK glyphs. Looks like they build separate localized fonts for each target country.
OK, I finally got examples/022_truefonts to dump all the glyphs in the NotoSansJP-Regular.otf font file. It reports 17802 glyphs. They are rather scattered around, with large gaps between some CJK characters (when ordered by CID), and others in large contiguous chunks. They do not appear to be in any order per the Unicode standard. For example, I see Katakana (U+30Dx neighborhood) appearing near the end at around G+65300 (0xFF10).
I must conclude that either 1) this Noto file is in fact not ordered in some sort of "Identity" one-to-one mapping, or 2) Alfred's .cmap file is missing something or broken, or 3) I/we just don't understand what "Identity" is supposed to be and do when it comes to CMaps.
Since Unicode does not define every single code point between U+0000 and U+FFFF, I would expect gaps in the CID (G+nnnnn) sequence, or possibly some sort of dummy placeholders in the gaps. If this font is claiming almost 18,000 glyphs, that should cover much of Unicode, but I think there should probably be many more than that. My Unicode 3.0 book claims 49,194 characters (I'll take their word for it -- I'm not going to count the damned things).
Anyone out there have any ideas on where to go from here?
Note: same error with Adobe's free SourceHanSans-Medium.otf.
I'm just stabbing blindly here, but it looks like you need to parse the cmap table directly to get correct results.
I've attached a crude Perl script that parses the output of ttx
from Adobe's Font Development Kit and generates a suitable identity.cmap
for replacing the stub Alfred suggested. This mostly works with NotoSansJP-Medium.otf and SourceHanSans-Medium.otf, but has to be generated for each font; they're quite different internally. Interesting, in both cases, the character that comes out wrong is 金 (U+91D1) (my test script generates locale-specific calendars, so the only kanji are the days of the week; I'm sure there are a lot of other CJK errors I'm not seeing, but 6 out of 7 worked).
ttx -q -t cmap -o - NotoSansJP-Regular.otf 2>/dev/null | ./cmap2perl.pl > identity.cmap
#!/usr/bin/env perl
use strict;
my %g2u;
my $incmap = 0;
while (<>) {
if (m|<cmap_format_4 platformID="0"|) {
$incmap++;
next;
}elsif (m|</cmap_format_4>|) {
last;
}
# <map code="0x29" name="cid00010"/><!-- RIGHT PARENTHESIS -->
my ($code,$id) = m|<map code="0x([0-9a-fA-F]+)" name="cid(\d+)"|;
$g2u{$id} = sprintf("0x%04X",hex($code));
}
print<<'EOF';
$cmap->{identity}={
'ccs' => [
'Adobe', # registry
'Identity', # ordering
0, # supplement
],
'cmap' => { # perl unicode maps to adobe cmap (TBD)
'ident' => [
'ident',
'Identity'
],
},
'g2u' => [
EOF
foreach my $id (sort {$a <=> $b} keys %g2u) {
print " $g2u{$id},\n";
}
print <<EOF;
],
'u2g' => {
EOF
foreach my $id (sort {$g2u{$a} cmp $g2u{$b}} keys %g2u) {
printf(" '%d' => '%d',\n",oct($g2u{$id}),$id);
}
print <<EOF;
}
};
EOF
If there is no one 'identity.cmap', but a new one has to be generated for each font file, well, that's absurd. We might as well just supply some tools with PDF::Builder and tell users to build their own .cmap files. I wonder if it would be better to just get rid of the whole .cmap file business. All it seems to be is a mapping of Unicode number to CID (u2g) and the corresponding inverse (g2u). Isn't all that information available in a TTF or OTF font file anyway? Or is the Unicode number for each glyph missing in at least some font files? Was the assumption that a "producer" might not have the appropriate font file in hand? If so, where do character widths and other information come from?
It makes me wonder if the supplied Japanese, Korean, and (two flavors of) Chinese .cmap files have errors in them, especially when applied to fonts following later revisions of the CMap standards (e.g., japanese.cmap is Rev 6, while the current is Rev 7). If they're "clean", why is Identity so fubarred? Note that even Latin-1 (Windows-1252/standard encoding) doesn't get done correctly for NotoSansJP in examples/022_truefonts!
Playing with adobe's perl scripts (in particular, I rewrote my quick hack to use the output of unicode-list.pl -g
), it looks like:
an "identity" cmap is a framework for building mappings, not an actual mapping,
glyph IDs are not a continuous range from 0..$num_glyphs,
A single glyph ID can map to multiple Unicode code points. For instance, in SourceHanSans, glyph 22397 => [0x6A02, 0xF914, 0xF95C, F9BF].
So where does this leave us? Is a .cmap file applicable only to TrueType/OpenType font file(s) that use (or permit) that particular mapping? If that's the case, this is big trouble. Do TTF/OTF contain the Unicode point itself (if there is one) for each glyph ('cmap' field)? If we can read in the Unicode value from the font file, we can dispense with the .cmap files, but then you need to have the TTF or OTF file in hand to generate the PDF (was that why .cmap was handled as a separate file?).
It is true that there is not a one-to-one mapping between glyphs and Unicode points. There are many ligatures, swashes, and other such typographic effects which have no Unicode (or map to multiple Unicode, such as a 'ct' ligature to [0x0065, 0x0074]
). Unless the PDF author can specify by Glyph ID (CID), which will be font-specific, they won't even be able to put such glyphs into the PDF (that sounds like a possible PDF::Builder enhancement, to specify glyphs by ID, including choosing swashes). Note that even if an author can specify a character by glyph (font file specific), or automatic substitution can be made (such as ligatures), the Unicode value(s) are still needed so that the PDF can be searched. Online documentation for 'cmap' suggests that even within a font file, there may be multiple cmaps (mappings between Unicode and CIDs).
The problems with "Identity" aside, can anyone tell if the four .cmap files currently supplied with PDF::Builder (and PDF::API2) are still valid, and for all font files claiming to use those mappings?
It's beginning to sound like this whole thing is fast becoming a big mess. I have a tendency to overthink these things and go down the rabbit hole of getting too complex a solution, so I'm hoping there is something better to do. Alfred, care to chime in on the subject, as you know the history of it?
Does Font::FreeType provide everything you need?
Font::FreeType and Font::FreeType::Glyph don't appear to offer anything I need (unless I'm badly misreading their descriptions). Maybe Font::TTF::Cmap will do the job? Font::TTF is already a prereq.
What we're starting with is the Unicode value of a character we wish to output, and at some point in the process we need a CID/Glyph ID to tell the font exactly which one to use (the CID is what's output in the PDF file). For now, let's ignore the issue of automatic ligatures and other substitutions (GSUB) and positioning (GPOS) vital to Indic (#35) and Arabic family languages, optional ligatures for English, etc. (and their suppression), the choice of swashes and other glyph variants, and the use of font-specific glyphs with no Unicode value. Obviously, the encoding you use for your text needs to be one that has the correct mapping available, whether it's an external (xxx.cmap) or internal (cmap table). Once the appropriate glyph is found, its metrics (width) can be obtained.
I'm hoping that the Unicode information (mapping) I need can always be found in the font file, and we fall back to the .cmap files only if the font file is not at hand during the production of the PDF. In that case, where font metrics come from in that case is an interesting question -- is it assumed that it's a fixed-pitch font? Would it be a great hardship to eliminate .cmap files and require that the font be present for the PDF writer?
How would you even generate a PDF if the font file wasn't available when the script ran?
Answering the earlier question about the coverage of Noto Sans JP, Google has multiple packaging options for Noto, but the one I picked turned out to be the most generically compatible option.
How would you even generate a PDF if the font file wasn't available when the script ran?
It's done all the time with core files, which are usually TTF behind the scenes. PDF::Builder supplies its own mapping (Latin-1) and metrics files, and it is assumed that the appropriate file will be found on the reader's machine. I haven't gone through the deep details, but my understanding is that the writer doesn't read (or embed) the font file.
For TTF/OTF, Type1, etc., I'm not sure if they actually require the font to be present on the machine writing the PDF. Possibly they do, otherwise where would they get the metrics?
In general, font handling is a mess. It would be nice to have (at least) TTF/OTF and Type1/PS handled identically, or at least as close to that as possible. Core should probably not be used for serious publishing work, as the encoding is limited and metrics may not match up. Bitmapped (BDF) is probably only for novelty effects.
Well, yes, but corefont()
doesn't matter for this issue, because it's guaranteed that a PDF reader has a metrics-compatible font in some format; that predates Acrobat 1.0 from when they were the PostScript core fonts that shipped with every printer. I don't remember which version of Acrobat first bundled the fonts used by cjkfont()
, but it was back in the Nineties.
If you call PDF::API2::ttfont()
, you are definitely loading a complete font file from disk:
sub new {
my ($class,$pdf,$file,%opts)=@_;
my $data={};
die "cannot find font '$file' ..." unless(-f $file);
my $font=Font::TTF::Font->open($file);
...
OK, but to get the discussion back on track,
Core fonts aside (whose use should probably be discouraged), it looks like other font methods need to have the font files present anyway at PDF generation.
Most TTF/OTF seem to have Unicode mapping(s) available, so it might be that we have to convert any single-byte encoded text to UTF-8 on the fly. We just have to be careful to leave room for GSUB (many-to-one Unicode-to-glyph possible) and GPOS work, optional ligatures and swashes, etc., and make sure the correct text is available for searching (e.g., 'ffl' ligature searched as f+f+l). Embedding fonts is good, too (since synthetic fonts can be embedded, perhaps Type1 can be too?).
No, I don't think so. Apart from identity, the old ones seem to work well enough (I've printed entire Japanese novels with PDF::API2 without spotting any incorrect kanji mappings), so maybe the short term solution is simply to provide a better error message for anything that falls through to "identity".
Type 1 fonts had a relatively small number of well-defined encodings, with all the interesting goodies stored in separate AFM/TFM/XFM files (and all sorts of workarounds for the limitations that imposed). I'm not even sure what box my old PostScript manuals are in these days, so I can't be more specific. Fortunately that's a completely different code path that doesn't involve g2u/u2g, and anyone wanting to use Unicode text won't miss them.
new resolution proposal:
if (defined $data->{'cff'}->{'ROS'}) {
to
if ((defined $data->{'cff'}->{'ROS'}) && ($data->{'cff'}->{'ROS'}->[1] ne 'Identity')) {
then if identity pops up, the opentype cmap will be used instead.
That works for SourceHanSans, and at least gets NotoSansJP to render alphanumerics (but not kanji).
Out of curiosity, I tried the NotoSansCJKjp packaging, and that one does work, including kanji.
Hmm. Looking at (Alfred's revised) code, is it possible that there was no real 'cmap' in the file, and thus u2g
and g2u
were not populated? That is, it was depending on there being an identity.cmap of some sort? Or perhaps there were other version(s) of the cmap than the expected MS version (->find_ms()
)?
Certainly, we could restrict .cmap usage to the four currently supported versions, and otherwise look for the font file's cmap section(s), but apparently we can't depend on find_ms
always working? If there is no MS cmap, what other ones should we look for? At the least, we could check that u2g and g2u of some sort were successfully loaded, before continuing.
Add:
First, a correction. Apparently (according to the Font::TTF::Cmap documentation) find_ms()
looks first for the MS cmap, and then tries others. If there are other cmaps, it should use one of them.
I did some playing around, dumping cmap data in TrueType/FontFile.pm in the non-.cmap section. My file NotoSansJP-Regular.otf has 5 tables in the cmap section. The first 3 are Platform 0 and the last 2 are Platform 3. All are Ver 0. The encodings are 3,4,5,1,10. ms_enc()
reports that the encoding is 10, so apparently it went with the fifth table.
Dumping out each u2g entry as it was created, there are a lot of gaps in the Unicode sequence once you get past Latin-1, but the Glyph IDs increase fairly steadily (although I did see some out-of-sequence). There are many Unicode entries in 1F1xx and 2xxxx ranges. All told there are 16697 entries, both by count and getting the size of u2g
. I don't know if there are any duplicate glyphs or if they're all unique.
P.S. There is a third ROS array value, some sort of Revision number or something. Currently that's ignored when deciding to use, say, japanese.cmap. I'm wondering if using a later (or just different) revision number might cause problems.
i have done some further research (reading otf spec, dumping/diffing fonts cmap).
seams like the age of Font::TTF is showing ...
find_ms is picking the Windows NT cmap (1 = Unicode 2.0 BMP) or MS Unicode (10 = says Unicode full complement).
the cmaps 3:10 and 0:4 from noto are identical, but i would prefer 0:4 and 0:5 over 3:10 for unicode context.
also Font::TTF (as of 1.06) does not support format 14 cmaps (notos 0:5) from the otf spec.
the otf spec still refers to unicode 2.0 -- :-((
so actually two things need to happen (in that order)
1 - patch to builder and API2 to switch to font cmap if cff encoding is identity (as above). 2 - patch Font::TTF:Cmap find_ms to prefer (platform:encoding) of 0:4 instead of 3:10/3:1. 3 - patch Font::TTF:Cmap to support format 14 cmaps and add preference for 0:5 above all.
c
Alfred, are your observations on preferable choices based only on looking at Noto fonts, or are they universal? We don't want to change PDF::Builder and/or Font::TTF based only on Noto (which could turn out to be an odd duck). If it's Noto-specific, perhaps an option could be added to ttfont()
to prefer different cmaps (and/or .cmap files) based on the specific font you're using?
ROS[2]
is a revision number of some sort). Should we consider preferably using the font's built-in cmap, and using the .cmap files only has a fallback? Why are there .cmap files in the first place, and do they still serve a purpose?find_ms()
and use our own code to do what you want (find the appropriate cmap, and populate u2g and g2u).Alfred, are your observations on preferable choices based only on looking at Noto fonts, or are they universal? We don't want to change PDF::Builder and/or Font::TTF based only on Noto (which could turn out to be an odd duck). If it's Noto-specific, perhaps an option could be added to
ttfont()
to prefer different cmaps (and/or .cmap files) based on the specific font you're using?
the observations are universal based on the otf-spec 1.8.3 (https://docs.microsoft.com/en-us/typography/opentype/spec/)
the real problem is that this breaks down based on the particular font-files themselves, since some are highly opinionated relative to one ttf/otf-spec or the other, based on apple or windows with unix/linux in between, or what works best observations on various points in time from 1991 to today.
- No problem, but I'm becoming concerned about the age and currency of the .cmap files. Most of them have later revisions out (I'm assuming
ROS[2]
is a revision number of some sort). Should we consider preferably using the font's built-in cmap, and using the .cmap files only has a fallback? Why are there .cmap files in the first place, and do they still serve a purpose?
- If this is universally good, and you can get Bob Hallissy et al. to quickly change Font::TTF, that would work. Otherwise, I think there is enough information in the font file to bypass
find_ms()
and use our own code to do what you want (find the appropriate cmap, and populate u2g and g2u).- I have no idea what's involved with this. Again, it's probably possible to bypass Font::TTF functions and work directly with the font data, if Bob can't provide a quick update.
i have tested Font::TTF 1.06 and besides not supporting the format 14 cmap it is as good as you may want it.
i have looked at the format 14 spec and it deals largely with variant glyphs which could impact cjk and/or arabic probably some more, but i am not an expert on this.
with the identity fix in place, the general assumption can be made that newer cff-fonts will be better supported as they seam to have internal cmaps, so you should be fine unless someone starts using the cff2 format (which is currently only an apple thingy).
ttf/otf technology stopped to be an exact science at the point adobe, ms, and apple shared the spec.
the .cmap files should be updated to their latest revision, since they are still needed to support legacy fonts and cff-fonts without an internal cmap. AFAIK each revision build on top of the former so higher revisions should be compatible with lower revisions besides the actual glyph-counts.
OK, I'll take a look at whether I can generate new .cmap files from the online sources. You wouldn't happen to still have any utilities for doing this? Are the four .cmap files sufficient (when updated), or should we support others? If so, it would probably be better just to ship some generator utility and let users create their own, rather than bloating PDF::Builder with lots of new .cmap files.
i have looked at the format 14 spec and it deals largely with variant glyphs which could impact cjk and/or arabic probably some more, but i am not an expert on this.
Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.
OK, I'll take a look at whether I can generate new .cmap files from the online sources. You wouldn't happen to still have any utilities for doing this? Are the four .cmap files sufficient (when updated), or should we support others? If so, it would probably be better just to ship some generator utility and let users create their own, rather than bloating PDF::Builder with lots of new .cmap files.
my former perl utilities are all lost to bit-rot or hd-failures.
you should get away with having only to update those already present.
Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.
the only code that i know that implements this correctly is either "harfbuzz" or "libicu" (both C/C++).
so actually two [sic] things need to happen (in that order)
1 - patch to builder and API2 to switch to font cmap if cff encoding is identity (as above). 2 - patch Font::TTF:Cmap find_ms to prefer (platform:encoding) of 0:4 instead of 3:10/3:1. 3 - patch Font::TTF:Cmap to support format 14 cmaps and add preference for 0:5 above all.
Re (3):
We're open to proposals (or PRs) for how to support format 14 cmaps. The immediate problems are that
A) the current interface assumes a cmap is a mapping from a codepoint to a glyph:
val
A hash keyed by the codepoint value (not a string) storing the glyph id
Format 14 cmaps are not such, but instead map each supported Unicode Variation Sequence (UVS) to a glyph.
B) the format 14 cmap is only a partial cmap that supplies the non-default glyphs for those UVS that override the default mapping contained in the font's "Unicode cmap" (quoting the spec).
Whatever structure we decide to represent the format 14 cmap, I don't think find_ms()
should ever return such, but should return the corresponding Unicode cmap.
Re (2):
I'm not sure I understand the arguments for changing the preferences as suggested -- feel free to create an issue on Font::TTF and make such arguments.
You can, of course, write your own code to find your preferred cmap. And, in case you're not in control of all the calls to find_ms()
, Perl allows you to replace Font::TTF::Cmap:find_ms with your own function.
Hi Bob, thanks for joining in and giving us the Font::TTF perspective!
Alfred, I've been trying to generate replacement *.cmap files, but I haven't quite gotten there. The best data seems to be the "cid2code.txt" files, but I haven't been able to quite match the existing files. For example, trying to create a new japanese.cmap, I find the cid2code file has multiple Unicode values for some CIDs, and there seems no rhyme or reason to which one is in japanese.cmap for that CID. Some CIDs have no Unicode values at all, while some value is given in japanese.cmap (which I don't know where it came from). I've checked the first few hundred entries, and found quite a few of these problems. No way am I going to check all twenty-thousand-plus entries by hand! I've got to get something that can be fully automated.
I did some playing around, dumping cmap data in TrueType/FontFile.pm in the non-.cmap section. My file NotoSansJP-Regular.otf has 5 tables in the cmap section. The first 3 are Platform 0 and the last 2 are Platform 3. All are Ver 0. The encodings are 3,4,5,1,10.
ms_enc()
reports that the encoding is 10, so apparently it went with the fifth table.
fwiw, I retrieved https://github.com/googlei18n/noto-cjk/blob/master/NotoSansJP-Regular.otf and note there are 6 cmaps:
Note that the 0/3 and 3/1 maps are the exact same data in the font (not just a copy -- the two cmap headers point to the same location in the file), and similarly for 0/4 and 3/10.
Hmm. I don't recall seeing that last one (platform 1). Does Font::TTF properly handle it? It may be a moot point, if Apple doesn't want it used any more. As I said, Font::TTF apparently chose 3/10.
so the preference order would be 0/6, 0/4, 3/10, 0/3, 3/1 for script fonts and 3/0 for symbol fonts. in microsoft environments the preference would change to 0/6, 3/10, 0/4, 3/1, 0/3 for script fonts
i recommend using the Adobe-*-UCS2 files from: https://github.com/apache/pdfbox/tree/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap for cmap generation
i recommend using the Adobe-*-UCS2 files from: https://github.com/apache/pdfbox/tree/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap for cmap generation
Adobe-Japan1-UCS2
the follow-on to Adobe-Japan1-6
? They don't seem to be in the same format, and other sources have a Adobe-Japan1-7
(though not with the information I need).Adobe-Japan1-UCS2
, I don't see anything that quite looks like a CID-to-Unicode mapping that I need. Do you know how to read this file? The file I was trying to use has a decimal CID, with a variety of Unicode (and other) values for each (although sometimes missing any, and sometimes multiple Unicode points)Clear? Huh! Why a four-year-old child could understand this report. Run out and find me a four-year-old child. I can't make head or tail out of it. -- Groucho Marx (Rufus T. Firefly, Duck Soup)
Bob Hallissy on RT system said:
<URL: https://rt.cpan.org/Ticket/Display.html?id=128768 >
AFAICT, the request is to change the order in which find_ms() locates a cmap, specifically to prefer the PlatformID=0 ("Unicode") cmaps over the PlatformID=3 ("Microsoft") cmaps.
At this point I haven't seen any reasons put forward as to why this change was being requested, and since there are lots of fonts that have Microsoft cmaps but no Unicode cmaps, I see no benefit in making the change.
Additionally, application writers can, of course, write their own code to find their preferred cmap. And, in case they're not in control of all the calls to find_ms(), they can replace Font::TTF::Cmap:find_ms with their own function.
128674 has been rejected. It sounds like Font::TTF will not be modified to accommodate Alfred's suggested revision. So where does that leave us? Should we replace the find_ms()
method with our own version, and if so, what is the justification? Keep in mind that PDF::Builder can be used on a wide range of platforms, including Linux, Mac, and Windows. A fixed lookup sequence may not be desirable. I assume that the Unicode-to-CID mapping is only of concern to the PDF writer (producer), and irrelevant to the reader (at least, if the font is embedded... what happens if the font has to be on the reader (consumer) machine?).
Further thoughts, given that Font::TTF itself is not going to change? And does this tie in at all with the *.cmap file usage?
If there is a functional reason to change find_ms() I'm open to doing it but I haven't heard a reason yet.
I guess it boils down to what behavior or problem are you having with the current code? It sounds to me like it is just a personal preference or bias against PlatformID=3. What am I not seeing?
I have no idea. Only Alfred (@terefang) can tell us the reasons for his request. My skin in the game is ensuring that PDF::Builder works correctly for as many people as possible -- I'm open to any and all reasonable ideas.
So Alfred, would Bob's find_cmap()
routine (I think it stands independently of Font::TTF innards) do the job for you? Would the lookup list be an option to ttfont()
or a new parameter or something else? And how would you prefer to handle MS vs non-MS systems? Does this pertain to just the producer (writer), or also to the consumer (reader) depending on whether or not the font is embedded? What should default behavior be (e.g., same as today)?
Once the selection of a built-in cmap is out of the way, what remains to be done with the four *.cmap files? I still don't have a good update for them. Should they continue to be used in preference to any built-in font cmap, but just for those 4 specific cases?
Bob's
find_cmap()
routine (I think it stands independently of Font::TTF innards)
Actually it does one thing that depends on knowledge of Font::TTF innards: it sets an internal object variable (' mstable') so as to remember the cmap that was actually found. This variable is then used by Font::TTF::Cmap::ms_lookup()
to find glyphs using the designated cmap.
I did this on the assumption that your code is actually using the ms_lookup()
function. If it is not, it would be cleaner to adjust the code in the gist to not set $cmap->{' mstable'}
and then it would be completely free of any innards knowledge.
yes
On Sat, Mar 16, 2019 at 1:12 AM Phil Perry notifications@github.com wrote:
So Alfred, would Bob's find_cmap() routine (I think it stands independently of Font::TTF innards) do the job for you? Would the lookup list be an option to ttfont() or a new parameter or something else? And how would you prefer to handle MS vs non-MS systems? Does this pertain to just the producer (writer), or also to the consumer (reader) depending on whether or not the font is embedded? What should default behavior be (e.g., same as today)?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PhilterPaper/Perl-PDF-Builder/issues/98#issuecomment-473478167, or mute the thread https://github.com/notifications/unsubscribe-auth/AAeCerf9QxGCF8jbY2Zd014YfakUyqIiks5vXDbcgaJpZM4bZS4z .
“The Principle of Priority states (a) you must know the difference between what is urgent and what is important, and (b) you must do what’s important first.” Steven Pressfield (born 1943) American writer
So how about tweaking the PDF::Builder code to do the following:
ttfont
says otherwise, try to use the existing 4 .cmap files* for CJK fonts.mstable
?).find_ms
as before.Would this work? What I'm still unclear about is when to specify a "Microsoft" cmap list and when to specify a "non-Microsoft" cmap list. Should PDF::Builder attempt to discover what platform it's running on? That could work if the font is to be embedded, but what if we specify "-noembed"? I suspect that the current code would favor the Microsoft cmap list. Should we ask the user for both lists to use, rather than just one, if they choose to specify their own cmap list, or do we leave the burden of figuring out what platform to the user (who supplies only the appropriate list)?
Should we just always force fonts to be embedded? That is, no/op "-noembed"? Are there likely to be CID (glyph number) mismatches if the font is not embedded?
so the preference order would be 0/6, 0/4, 3/10, 0/3, 3/1 for script fonts and 3/0 for symbol fonts. in microsoft environments the preference would change to 0/6, 3/10, 0/4, 3/1, 0/3 for script fonts
These would be the default cmap list: always 0/6 first, followed by 0/4 and 3/10 (reverse for MS environment), followed by 0/3 and 3/1 (reverse for MS environment). Would a symbol font have only 3/0 anyway (tack it on to the end of both lists), or do we need to look at the font and see if it's symbol?
*
I still need to update the .cmap files
I can't move forward with implementing this until I have an idea of what would be considered proper behavior of the code.
find_ms()
function) a good one? Does it cover all the bases?And where do I find the information to build updated .cmap files? Add: Is there a guarantee that a given Unicode point will always map to the same CID (and thus, correct glyph)? Per point 1, why would there be a fixed mapping for four specific CJK alphabets, but not for other alphabets? I'm wondering why we shouldn't use the built-in cmaps for those fonts, unless the problem is that they specifically don't have cmaps.
I have updated FontFile.pm to look in .cmap files (if -usecmf=>1
), then in the list of Platform/Encodings given by -cmaps
(or its default list), then using find_ms()
, and finally back to .cmap files (regardless of -usecmf
setting). One of those should hopefully find a workable CMap! There is also a -debug
flag to show diagnostic information in the hunt for a CMap. I am leaving the ticket open for now as a reminder that the four .cmap files are still in need of updating, and I have not yet found a suitable data source (Unicode/glyphID mapping) to generate new .cmap files. Many thanks to Alfred Reibenschuh (original PDF::API2 author) and Bob Hallissy (Font::TTF author) for their assistance.
Update: The issue of whether something should be done with -noembed
remains open. Perhaps it should be no-op'd, but I'd like to get some feedback first.
I found some sources from Adobe and I think I have the four .cmap files updated to current status (2019-05-29). Therefore I am closing this issue (in GitHub, 3.016 release). Please feel free to reopen (or open a new one) if you find a better update to these files.
perlbug [...] jgreely.com (ssimms/pdfapi2#32)
PDF::API2 version 2.033 Font::TTF version 1.06 Perl version 5.20.2
Sample code:
dies with:
About a third of the CJK fonts I have don't work with PDF::API2. I used Noto Sans JP in the above example because it's freely available, but it's not the only one, and at least one other PostScript-flavored OTF font that I own does work (DFKyoKaShoStd-W4.otf). Is there a workaround or a fix?
-j