Open unhammer opened 3 years ago
Hi! Thank you so much for your suggestions. It becomes challenging.For me to be sincere my technical skills are limited and I don't really know what I can do.
Best, Fatouma
On Wed, Apr 14, 2021 at 2:54 PM Kevin Brubeck Unhammer < @.***> wrote:
$ echo "kũuni kũuni" | apertium -d . mos-morph
^kũuni/kũuni
$ ^kũuni/*kũuni$ The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.
The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.
.acx doesn't help here since it's two codepoints.
Possible solutions:
- use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: k
un - use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how
- change lttoolbox to treat them equivalently – big job, but everyone wins
@fatkab https://github.com/fatkab @ftyers https://github.com/ftyers thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apertium/apertium-mos/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJNVJ6Y3R6XR52EEGJQ3FTTIWGBNANCNFSM425KVS2A .
- use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how
You can use hfst-substitute pre-composed characeters with automaton containing the disjunction but it's a lot of hacking around
- change lttoolbox to treat them equivalently – big job, but everyone wins
The easiest thing is to use a spellrelax-type script, e.g. this one for Basaa.
As a stop-gap measure, I've added a normalizing morph mode mos-nmorph
in 73cc4b4 that uses uconv -x any-nfc
.
The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.
The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.
.acx doesn't help here since it's two codepoints.
Possible solutions:
<i>k</i><par n="ũ"/><i>un</i><par n="kũun/i__n"/>
@fatkab @ftyers thoughts?