Wakamai-Fondue / wakamai-fondue-engine

The engine that powers Wakamai Fondue
Apache License 2.0
46 stars 9 forks source link

lang-tags-from-hb.py does not match HarfBuzz’s disambiguation strategy #10

Closed dscorbett closed 4 years ago

dscorbett commented 4 years ago

lang-tags-from-hb.py disambiguates language system tags that correspond to multiple BCP 47 tags by parsing hb_ot_ambiguous_tag_to_language. However, this does not catch ambiguous language system tags not handled by that function. For example, 'IWR ' should be converted to he, and yet it current is: https://github.com/Wakamai-Fondue/wakamai-fondue-engine/blob/c181dd619375325f27897f9d81c395bd473f016c/src/tools/ot-to-html-lang.js#L878-L881

HarfBuzz’s disambiguation strategy is to consult hb_ot_ambiguous_tag_to_language then, if there was no match, pick the alphabetically first match in ot_languages. Wakamai Fondue’s strategy is to consult hb_ot_ambiguous_tag_to_language then, if there was no match, pick the alphabetically last match in ot_languages. (If iw were the recommended code for Hebrew and he were the deprecated code, then HarfBuzz would have to handle Hebrew explicitly, but because the recommended code happens to alphabetically precede the other, it doesn’t.)

RoelN commented 4 years ago

Thank you kindly for bringing this to my attention! I'm glad you caught this, as this was a serious mistake.

Looking more closely at the output, I observed something that might be bugs, I wonder how you think about this:

First observation

Some languages are commented out, for instance Acoli/Acholi ("ACH "). As far as I understand, this would result in a "x-hbot-AABBCCDD" code for "ACH ", instead of the "ach" code in the commented-out line:

/*{"ach",   HB_TAG('A','C','H',' ')},*/ /* Acoli -> Acholi */

So I decided to add these commented-out lines anyway, so I'd get the "ACH " → "ach" translation.

But other commented-out lines mess things up:

/*{"duj",   HB_TAG('D','U','J',' ')},*/ /* Dhuwal (retired code) */

Because "duj" is a deprecated code, it should be "dwu". So here I shouldn't use the commented-out line.

Is this an error in Acoli/Acholi language?

Second observation:

For Divehi ("DHV "), the following tag is set. Note the comment saying this is deprecated:

{"dv",  HB_TAG('D','H','V',' ')},   /* Divehi (Dhivehi, Maldivian) (deprecated) */

As far as I can tell, this is the correct tag from the IANA data:

%%
Type: language
Subtag: dv
Description: Dhivehi
Description: Divehi
Description: Maldivian
Added: 2005-10-16
Suppress-Script: Thaa
%%
dscorbett commented 4 years ago

If neither hb_ot_ambiguous_tag_to_language nor ot_languages contains a mapping for an input language, but the input is three characters long, HarfBuzz capitalizes it to get the language system tag. This is an optimization to make ot_languages shorter. For example, "ach" is mapped to 'ACH ' which is mapped back to "ach-x-hbot-41434820". So it’s not a mistake that the "ach" line is commented out.

The "duj" line is commented out because "duj" corresponds to 'DUJ ', its capitalized equivalent. If it were not commented out but "duj" were still a deprecated subtag, hb_ot_ambiguous_tag_to_language would have to explicitly map 'DUJ ' to "dwu", the preferred subtag; however, since it is just a comment, HarfBuzz can ignore it.

So here is the algorithm I suggest for you. If a language system tag is listed in hb_ot_ambiguous_tag_to_language, use that mapping. Otherwise, if the tag is listed in ot_languages, use the first mapping (ignoring comments). Otherwise, if the tag is three non-spaces followed by a space and the tag appears in a comment in ot_languages, use the mapping in that comment. Otherwise, convert the tag to "x-hbot" notation.

  {"dv",    HB_TAG('D','H','V',' ')},   /* Divehi (Dhivehi, Maldivian) (deprecated) */

That means that the OpenType language system tag 'DHV ' is deprecated, not the BCP 47 subtag "dv".

RoelN commented 4 years ago

Thanks again for your insights, this is really helpful.

Since I also need the language names for three-character tags that are the same, I now take this approach:

  1. Take non-commented tags listed in ot_languages
  2. Replace any existing tag with ones from hb_ot_ambiguous_tag_to_language
  3. Append commented-out tags, when the previous two steps haven't added them

I also remove the string (deprecated) from the language name, as OT spec deprecation this is not something I need to communicate.

This fixes the Dhuwal tag issue.

I don't think I can use the "x-hbot" notation as I'm only interested in OT-to-BCP47 translation, to put inside an HTML tag's lang attribute.

Unless I'm missing things, I think I now have a valid OT-to-BCP47 mapping! :-)

dscorbett commented 4 years ago

There is one more problem, but this time it is in HarfBuzz: harfbuzz/harfbuzz#2669.

RoelN commented 4 years ago

@dscorbett Thanks for verifying my work again, greatly appreciated! I'll update our list once https://github.com/harfbuzz/harfbuzz/pull/2669 is merged.