handling both combined and non-combined characters equivalently

unhammer commented 3 years ago

$ echo "kũuni kũuni" | apertium -d . mos-morph
^kũuni/kũuni<n><sg>$ ^kũuni/*kũuni$

The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.

The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.

.acx doesn't help here since it's two codepoints.

Possible solutions:

use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: <i>k</i><par n="ũ"/><i>un</i><par n="kũun/i__n"/>
use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how
change lttoolbox to treat them equivalently – big job, but everyone wins

@fatkab @ftyers thoughts?

fatkab commented 3 years ago

Hi! Thank you so much for your suggestions. It becomes challenging.For me to be sincere my technical skills are limited and I don't really know what I can do.

Best, Fatouma

On Wed, Apr 14, 2021 at 2:54 PM Kevin Brubeck Unhammer < @.***> wrote:

$ echo "kũuni kũuni" | apertium -d . mos-morph

^kũuni/kũuni$ ^kũuni/*kũuni$

The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.

The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.

.acx doesn't help here since it's two codepoints.

Possible solutions:

use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: kun

use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how

change lttoolbox to treat them equivalently – big job, but everyone wins

@fatkab https://github.com/fatkab @ftyers https://github.com/ftyers thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apertium/apertium-mos/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJNVJ6Y3R6XR52EEGJQ3FTTIWGBNANCNFSM425KVS2A .

flammie commented 3 years ago

use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how

You can use hfst-substitute pre-composed characeters with automaton containing the disjunction but it's a lot of hacking around

change lttoolbox to treat them equivalently – big job, but everyone wins

https://github.com/apertium/organisation/issues/24

ftyers commented 3 years ago

The easiest thing is to use a spellrelax-type script, e.g. this one for Basaa.

mr-martian commented 3 years ago

As a stop-gap measure, I've added a normalizing morph mode mos-nmorph in 73cc4b4 that uses uconv -x any-nfc.

apertium / apertium-mos

handling both combined and non-combined characters equivalently #2