apertium / apertium-tat

Apertium linguistic data for Tatar
GNU General Public License v3.0
4 stars 3 forks source link

Ул/Бул #13

Open mansayk opened 5 years ago

mansayk commented 5 years ago

Is "Ул/бул" parsed correctly here:

echo "Ул ташламас сине." | apertium-destxt -n | lt-proc -z -w 'apertium-tat/tat.automorf.bin' | cg-proc -z 'apertium-tat/tat.rlx.bin' | apertium-retxt

^Ул/бул<v><tv><imp><p2><sg>/ул<prn><pers><p3><sg><nom>/бул<v><iv><imp><p2><sg>/ул<prn><dem><nom>$ ^ташламас/ташла<v><tv><neg><gpr_fut>/ташла<v><tv><neg><fut><p3><sg>$ ^сине/син<prn><pers><p2><sg><acc>$^./.<sent>$

IlnarSelimcan commented 5 years ago

The бул v iv analysis was added to deal with the 19th century corpus texts I've been working on, i.e. улмак = булмак which shows up in them quite frequently. I think the way to go here is to mark all such archaic words with some flag and prune them while compiling unless the user specifies a compilation flag which keeps them.

mansayk commented 5 years ago

I understand. Unfortunately, in my case this one is even chosen after disambiguation.

jonorthwash commented 5 years ago

@IlnarSelimcan, it would probably be fairly straightforward to write a disambigution rule to deal with some of these.

Alternatively, sometimes it can make sense to just treat things like бул-/ул- as synonyms, and deal with them as such in later stages for translation.