apertium / apertium-mos

Apertium linguistic data for Mossi
GNU General Public License v3.0
0 stars 2 forks source link

handling both combined and non-combined characters equivalently #2

Open unhammer opened 3 years ago

unhammer commented 3 years ago
$ echo "kũuni kũuni" | apertium -d . mos-morph
^kũuni/kũuni<n><sg>$ ^kũuni/*kũuni$

The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.

The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.

.acx doesn't help here since it's two codepoints.

Possible solutions:

@fatkab @ftyers thoughts?

fatkab commented 3 years ago

Hi! Thank you so much for your suggestions. It becomes challenging.For me to be sincere my technical skills are limited and I don't really know what I can do.

Best, Fatouma

On Wed, Apr 14, 2021 at 2:54 PM Kevin Brubeck Unhammer < @.***> wrote:

$ echo "kũuni kũuni" | apertium -d . mos-morph

^kũuni/kũuni$ ^kũuni/*kũuni$

The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE, the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.

The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.

.acx doesn't help here since it's two codepoints.

Possible solutions:

  • use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: kun
  • use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how
  • change lttoolbox to treat them equivalently – big job, but everyone wins

@fatkab https://github.com/fatkab @ftyers https://github.com/ftyers thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apertium/apertium-mos/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJNVJ6Y3R6XR52EEGJQ3FTTIWGBNANCNFSM425KVS2A .

flammie commented 3 years ago
  • use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how

You can use hfst-substitute pre-composed characeters with automaton containing the disjunction but it's a lot of hacking around

  • change lttoolbox to treat them equivalently – big job, but everyone wins

https://github.com/apertium/organisation/issues/24

ftyers commented 3 years ago

The easiest thing is to use a spellrelax-type script, e.g. this one for Basaa.

mr-martian commented 3 years ago

As a stop-gap measure, I've added a normalizing morph mode mos-nmorph in 73cc4b4 that uses uconv -x any-nfc.