MicrosoftDocs / typography-issues

Creative Commons Attribution 4.0 International
47 stars 21 forks source link

Encoding information is incomplete #814

Open DarrenHoffman opened 3 years ago

DarrenHoffman commented 3 years ago

In some cases multiple characters are represented by a single glyph-form, but the encoding info only covers single character glyph forms. Clearly Microsoft Word is aware of the mappings so they definitely exist. It would be good to add them to the documentation.

For example, via experimentation I have found that the two character string (in logical order) <U+644 U+627> maps to the single glyph at 0xF2DC. But simply by going off of the documentation at this page one would only know to map <U+644 U+627> to the two glyph string <0xF2BB 0xF242>. The difference in appearance is stark, though an Arabic Language reader may recognize these two forms as equivalent as far as I know (I myself do not know how to read Arabic at all), but surely the single glyph form is preferable if not absolutely required.

If I have made any mistakes or overlooked any relevant information, I apologize in advance.

Edit: after taking a closer look at the Arab Unicode blocks I now see there are single Unicode values for these multi character presentation form glyphs. Nonetheless, those mappings are still not present in this Microsoft documentation page of these legacy encodings. So it would still be much appreciated if the page could be updated with the presentation form mappings. Otherwise I guess I will have to go through them one by one and have to figure out what the correct mappings painstakingly.


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

PeterConstable commented 3 years ago

The difference in appearance is stark, though an Arabic Language reader may recognize these two forms as equivalent as far as I know (I myself do not know how to read Arabic at all), but surely the single glyph form is preferable if not absolutely required.

The single glyph form is absolutely required.

The intent of the documentation was not to provide all information needed to support Arabic script, but only to document the legacy 8-bit encodings* associated with ARABIC_CHARSET_SIMPLIFIED and ARABIC_CHARSET_TRADITIONAL. I think the article does that.

* Implementations using the legacy symbol encoding would store strings as 8-bit characters, but then in the display pipleline map to 16-bit code points in the 0xFxxx range (which happens to correspond to Unicode PUA characters). This is mentioned in that article (see The nature of the font encodings).

DarrenHoffman commented 3 years ago

First of all thanks for replying, I appreciate it, I really do, However I don't think I really explained things well. Maybe it's kind of complicated, or maybe I need to try a little harder to boil things down to the essence of the problem, but please see if you can parse what I'm saying here below. I'm pretty pretty sure that the documentation is missing some of the encoding information that is specific to the ARABIC_CHARSET_TRADITIONAL encoding.

So to be clear, I'm not asking for help with general Arabic script support. _I am specifically working with legacy fonts that only have a 3,0 cmap table and where the fSelection upper byte code is ARABIC_CHARSETTRADITIONAL.

And, I am still not convinced the legacy 8-bit encodings are fully documented here. I have a set of these legacy fonts in hand at the moment. They contain almost 256 glyphs mapped to values in the 0xF200 to 0xF2FF range of the 3,0 cmap table. So that's essentially an 8-bit encoding. There is absolutely no other encoding info in these fonts.

And yet Microsoft Word (with a DOCX file, using only Arabic Unicode values from the Unicode block 600-6FF) is able to map the pair of Unicode values I mentioned above to one of the presentation forms in the glyph that is not included in the mappings provided by this web page. All the mappings provided by this web page are consistent with what appears in the font files I have. So this certainly implies Microsoft Word has some further information about the encodings in these fonts that is not provided in this web page.

I don't know if I am being clear enough with the details. So I think we can both agree on at leas the following 3 statements:

  1. the ARABIC_CHARSET_TRADITIONAL tables on this web page cover Unicode values from 0x620 thru 0x65F.
  2. those Unicode values are mapped to up to four different (for the 4 basic forms) 8 bit values that are to be added to 0xF200 to get the PUA Unicode mapping for the glyph in the font.
  3. However, there is no mapping from any Unicode value (or sequence of values) to the PUA Unicode value 0xF2DC (which would just be listed as the 8 bit value DC in the ARABIC_CHARSET_TRADITIONAL encoding tables). It's not there.

But in the fonts I have there is a glyph for 0xF2DC, and while I don't know Arabic I can certainly see that it is the combined isolate presentation form for the Unicode String <U+644 U+627>, for which there is also a single Unicode value in the Arabic Presentation Forms-B block of the Unicode standard, 0xFEFC, I think.

And the final thing is, that Microsoft Word (in the DOCX format using valid Unicode values internally for text, with the string <U+644 U+627>, when set to use one of the legacy fonts I am working with, Microsoft Word is able to correctly select that glyph encoded as 0xF2DC, even though that code point does not appear on this web page. That implies there is further encoding information relevant to the ARABIC_CHARSET_TRADITIONAL encoding that is missing from this web page.

And let me be clear I have verified it is in fact pulling the 0xF2DC glyph from the legacy font I have, and not substituting the glyph from a different, Unicode supporting, font.

DarrenHoffman commented 3 years ago

So maybe these 3 screen shots can do a better job of illustrating my issue than all that text in the previous post:

So, HOW does Word know to use the glyph at 0xF2DC?! That one isn't included in the mapping on the ARABIC_CHARSET_TRADITIONAL's documentation here.

It must be that there is more non-standard legacy encoding information Word knows about ARABIC_CHARSET_TRADITIONAL fonts than is documented on this webpage. If there is any other explanation please let me know.

image image image

tiroj commented 3 years ago

Yes, Word is performing a many-to-one character mapping for the lam_alif ligatures (isolated and final forms, plain and carrying marks). These are orthographically required ligating forms in which the way the letters interact is graphically distinct from the combination of initial/medial+final forms of these letters. I agree this should probably be discussed in the legacy Arabic font article, for the sake of completion, even though those fonts are pretty obsolete and presumably only supported in Microsoft software for backwards compatibility reasons.

It would be nice to know, for example, where in the display chain the ligature mapping is done: is it at the top level of Arabic character input, or after the joining form mapping?

DarrenHoffman commented 3 years ago

Do you happen to know offhand how many of these orthographically required ligating forms are used with ARABIC_CHARSET_TRADITIONAL. I can't personally tell just by eyeballing the glyphs in the font file, though I guess I could probably figure it out sooner or later with a close examination.

Update: after staring at the glyphs in one of these fonts for a couple days I can actually tell which ones are the ligatures now and what characters they are ligatures of.

According to Wikipedia lam_alif is the only required one, is that correct?

khaledhosny commented 3 years ago

The documentation indeed does not cover any ligatures. I had to reverse-engineer them by looking at the fonts and double-checking my guess with what Uniscribe does. The ligature table I came up with is here (part of https://github.com/harfbuzz/harfbuzz/pull/3063).

PeterCon commented 3 years ago

A number of glyphs are presentation ligature forms for combinations other than orthographically-required ligatures. In your screen shot above, F2D6 and F2D7 are initial and isolate forms for the combination DAL + HAMZA ABOVE (in Unicode, the sequence <U+062F, U+0654>.

PeterCon commented 3 years ago

@khaledhosny In your harfbuzz data, you have F21D, F21E and F21F (for Traditional) as

In the Royal Arabic font, though, these are final forms.

In Microsoft's legacy Arabic engine (25-year-old code) I can't easily tell which ligatures are assumed to be initial, medial final or isolated (there's one array that combines all ordered by component count then legacy code point), but the comments for F212..F214 and F21C..F21F indicate those are final.

For ligatures beginning with LAM, I'm not familiar enough with Arabic script to tell in the Royal Arabic font whether these are isolate, final or intended to be used as both.

image

Royal Arabic is just one representative of TRADITIONAL that I happen to have looked at. I have no way of knowing what to consider truth—a sample font or source code, either of which may be buggy. (I think I've found one bug in the array shown in the article: for 0638 isolate, I suspect 0xa3 s/be 0xa6.)

DarrenHoffman commented 3 years ago

The documentation indeed does not cover any ligatures. I had to reverse-engineer them by looking at the fonts and double-checking my guess with what Uniscribe does. The ligature table I came up with is here (part of harfbuzz/harfbuzz#3063).

thank you very much. That will probably be very helpful.

DarrenHoffman commented 3 years ago

Royal Arabic is just one representative of TRADITIONAL that I happen to have looked at. I have no way of knowing what to consider truth—a sample font or source code, either of which may be buggy. (I think I've found one bug in the array shown in the article: for 0638 isolate, I suspect 0xa3 s/be 0xa6.)

Yes, I am sure you are correct about that bug. 0xa6 is clearly the isolate (in the fonts I have at least). And I don't believe any of the characters should have exactly 3 distinct glyph forms as would be the case if 0xa3 were correct. You either have 1 form (essentially the isolate glyph for all cases, but that's not common), or 2 forms (one for Isolate+Initial and one for Medial+Final, slightly more common), or all 4 distinct forms (which I believe most of the characters have). Any glyph that has at least 3 distinct glyphs would also have the 4th form distinct as well, it wouldn't make sense not to, I'm pretty sure). And you can see 0xa3 has the trailing connector stroke that would make it a distinct Initial form.

PeterConstable commented 3 years ago

@DarrenHoffman Do you know of fonts that use SIMPLIFIED charset?

DarrenHoffman commented 3 years ago

@DarrenHoffman Do you know of fonts that use SIMPLIFIED charset?

no, I just have this one set of TRADITIONAL font faces. I can check with my customer but I think it's just this one family they are using.

ebraminio commented 3 years ago

@DarrenHoffman Do you know of fonts that use SIMPLIFIED charset?

I found there are just lots of such fonts here, and here is the simplified ones 1, 2 and 3. There are also some others like 4 which apparently are different with rest, which don't have 0xB3 (traditional) and 0xB2 (simplified) and have 0xEE which apparently we don't know what it means, I've listed the fonts on the page along their codes here. Of them 468 ones are "Traditional Arabic Windows 3.1 font page", 13 ones are "Simplified Arabic Windows 3.1 font page" and 3 with 0xEE00 (unknown font page?, "Phyllis ATT Italic.TTF", "LUCASIT.TTF" and "SIGNETRO.TTF".)

DarrenHoffman commented 3 years ago

@ebraminio ah, thank you very much.

khaledhosny commented 3 years ago

@khaledhosny In your harfbuzz data, you have F21D, F21E and F21F (for Traditional) as

You are right, I probably was looking into a font that didn’t differentiate enough between final and isolated forms of these ligatures (the table was based on some python tool I wrote long ago and I don’t remember all the details). I double checked by shaping with Uniscribe and the ligatures were indeed used in final form (i.e. between medial and final glyphs).

DarrenHoffman commented 3 years ago

It would be nice to know, for example, where in the display chain the ligature mapping is done: is it at the top level of Arabic character input, or after the joining form mapping?

well I noticed that in the font I have there is a ligature for I believe "Allah", except that the ligature's glyph in the font file doesn't actually include the first character, the alif (I think that's the one) character. But the only Unicode representation I can find for the ligature includes the Alif. So with that in mind, I guess when using these fonts it might be better to map Unicode characters to the font's PUA codes before performing ligature substitution since there doesn't seem to be a clean Unicode mapping. But I'm not sure, that's just one thing I noticed. But it seems to require a slightly different ligature mapping rules than for a modern Unicode font.

khaledhosny commented 3 years ago

There at least three ligatures with no corresponding Unicode Presentation Forms code points, 0xF201 as you noted, as well as 0xF211 and 0xF2EE.

PeterCon commented 3 years ago

@DarrenHoffman @khaledhosny I've updated the page with complete details on the presentation encoding for Traditional Arabic. Please review—in particular, let me know if the presentation works.

At some point, if some representative Simplified fonts can be obtained, a similar update might be done for that.

khaledhosny commented 3 years ago

Looks good, but for entries like:

064B fathatan F2E7 or F2F5

I understand these are the high and low variants, but there is no explanation how they are used. I have always assumed this provides mset-like behavior, but stating this explicitly (if it is really the case) might be a good idea.

PeterCon commented 3 years ago

@khaledhosny I've updated per your comments.

khaledhosny commented 3 years ago

@khaledhosny I've updated per your comments.

Looks good, thanks.