Open flammie opened 3 years ago
We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.
ICU provides an a way to define custom normalizations. The documentation isn't terribly helpful, but it looks to me like we just need to edit https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2/nfc.txt to make a more conservative NFC and then use these instructions https://unicode-org.github.io/icu/userguide/transforms/normalization/ under this license https://www.unicode.org/license.html
We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.
Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?
From what I can see, we just don't want any of the >
rules. E.g. rule 212A>004B
says Kelvin sign should turn into capital K.
A quick'n'dirty shortcut would be to use a transformation that only hits grapheme clusters with combining marks. For example:
echo -n 'ôôÅÅ' | uconv -x '([:^Nonspacing Mark:] [:Nonspacing Mark:]+) > &NFC($1)' | uconv -x any-name
yields
\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{ANGSTROM SIGN}\N{LATIN CAPITAL LETTER A WITH RING ABOVE}
It turns ô
(U+006F U+0302) into ô
(U+00F4), but doesn't touch Å
.
However, it would touch Å
if that had any combining marks after it. I posit that is so rare we don't have to worry.
Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?
https://gist.github.com/mr-martian/80d99c2ca29a36ac483cca84bbc4ec3a
Not quite collaborative editing, but hopefully at least a bit more readable
https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c
And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.
https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c
And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.
Hmm, this looks all ok to me, though I have no good knowledge for most scripts in the list. It doesn't seem to have anything more problematic than Å for Ångström sing and K for Kelvin sign afaics, for latin / generic?
Should this be a step that apertium/apy runs before the pipeline? or something done within morph analysis? (My first thought is it seems easier and cleaner to do it before analysis)
I would expect it to be in conjunction with format handling (either before or after, not sure which).
Should deformating take care of this? Or are you thinking something in between deformating and analysis?
Inserting a normalizer between deformatting and analysis would handle it without requiring every deformatter to be updated and also deals with the issue (that I guess was discussed on IRC rather than here) that sooner or later someone might care about normalized vs not and want to turn it off.
Having it after deformatting would mean it could run on only the translated parts of the text, and not touch formatting (so that when Word2022 exports an html page with combining chars in its class names it will still look as ugly as intended)
Relevant IRC log: https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2021-02-11.log
And here's a helper script I have for a similar task: https://gist.github.com/TinoDidriksen/aa6b8047e26fb6876b4b9f90c51988f3
It seems to me that good portion of apertium IRC traffic is people checking on unicode character variants like:
I think this is something that the tools should take care of somehow, I'd suggest NFC normalization for all input, perhaps with a warning in compiler type tools. NFC is the nicest for most FSA letter automata. If agreed this might be a good starter task for gsoc candidates?