Unicode normalisation across apertium tools

flammie commented 3 years ago

It seems to me that good portion of apertium IRC traffic is people checking on unicode character variants like:

10:43 +spectie> .u ô
10:43  begiak> U+006F LATIN SMALL LETTER O (o)
10:43  begiak> U+0302 COMBINING CIRCUMFLEX ACCENT (âWL̂)
10:43 +spectie> .u ô
10:43  begiak> U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX (ô)

I think this is something that the tools should take care of somehow, I'd suggest NFC normalization for all input, perhaps with a warning in compiler type tools. NFC is the nicest for most FSA letter automata. If agreed this might be a good starter task for gsoc candidates?

TinoDidriksen commented 3 years ago

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

mr-martian commented 3 years ago

ICU provides an a way to define custom normalizations. The documentation isn't terribly helpful, but it looks to me like we just need to edit https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2/nfc.txt to make a more conservative NFC and then use these instructions https://unicode-org.github.io/icu/userguide/transforms/normalization/ under this license https://www.unicode.org/license.html

flammie commented 3 years ago

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

TinoDidriksen commented 3 years ago

From what I can see, we just don't want any of the > rules. E.g. rule 212A>004B says Kelvin sign should turn into capital K.

TinoDidriksen commented 3 years ago

A quick'n'dirty shortcut would be to use a transformation that only hits grapheme clusters with combining marks. For example: echo -n 'ôôÅÅ' | uconv -x '([:^Nonspacing Mark:] [:Nonspacing Mark:]+) > &NFC($1)' | uconv -x any-name yields \N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{ANGSTROM SIGN}\N{LATIN CAPITAL LETTER A WITH RING ABOVE}

It turns ô (U+006F U+0302) into ô (U+00F4), but doesn't touch Å.

However, it would touch Å if that had any combining marks after it. I posit that is so rare we don't have to worry.

mr-martian commented 3 years ago

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

https://gist.github.com/mr-martian/80d99c2ca29a36ac483cca84bbc4ec3a

Not quite collaborative editing, but hopefully at least a bit more readable

mr-martian commented 3 years ago

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

flammie commented 3 years ago

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

Hmm, this looks all ok to me, though I have no good knowledge for most scripts in the list. It doesn't seem to have anything more problematic than Å for Ångström sing and K for Kelvin sign afaics, for latin / generic?

unhammer commented 3 years ago

Should this be a step that apertium/apy runs before the pipeline? or something done within morph analysis? (My first thought is it seems easier and cleaner to do it before analysis)

mr-martian commented 3 years ago

I would expect it to be in conjunction with format handling (either before or after, not sure which).

xavivars commented 3 years ago

Should deformating take care of this? Or are you thinking something in between deformating and analysis?

mr-martian commented 3 years ago

Inserting a normalizer between deformatting and analysis would handle it without requiring every deformatter to be updated and also deals with the issue (that I guess was discussed on IRC rather than here) that sooner or later someone might care about normalized vs not and want to turn it off.

unhammer commented 3 years ago

Having it after deformatting would mean it could run on only the translated parts of the text, and not touch formatting (so that when Word2022 exports an html page with combining chars in its class names it will still look as ugly as intended)

TinoDidriksen commented 3 years ago

Relevant IRC log: https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2021-02-11.log

TinoDidriksen commented 3 years ago

And here's a helper script I have for a similar task: https://gist.github.com/TinoDidriksen/aa6b8047e26fb6876b4b9f90c51988f3

apertium / organisation

Unicode normalisation across apertium tools #24