apertium / apertium-mar

Apertium linguistic data for Marathi
GNU General Public License v3.0
0 stars 1 forks source link

Zero-width [non-]joiners #2

Open shardulc opened 6 years ago

shardulc commented 6 years ago

The word निश्चित is apparently in the monodix:

<e lm="निश्चित">            <i>निश्चित</i><par n="/लाल__adj" /></e>

But I tried to analyze a newspaper article, and

^निश्‍चित/*निश्‍चित$

Turns out that the monodix has श + ् + च + ि while the article has श + ् + \u200d + च + ि where \u200d is a zero-width joiner that forces श्‍च instead of the ligature श्च. This is also used in that same article for क्ल/क्‍ल, क्त/क्‍त, etc.

@ftyers We were recently talking about this in #apertium but I didn't quite get the recommended solution. How do we handle this? Would it help to make an exhaustive list of ligatures where the ZWJ is used, or can we somehow just ignore that character?

ftyers commented 6 years ago

If it's in the stem, then something like:

<e lm="निश्चित" r="LR">     <p><l>निश्‍चित</l><r>निश्चित</r></p><par n="/लाल__adj" /></e>

Unfortunately it isn't possible to just "ignore" a character unless it's done in the lookup code.

shardulc commented 6 years ago

Thanks. I've added the ZWJ forms for everything involving श्च, क्ल, क्त in 5615280. Keeping this issue open because there might be more.