apertium / apertium-kir

Apertium linguistic data for Kyrgyz
GNU General Public License v3.0
13 stars 3 forks source link

Transliteration to and from the Arabic script #13

Open alexeyev opened 5 months ago

alexeyev commented 5 months ago

Dear colleagues,

thank you for your work! Judging by the paper Multi-script morphological transducers and transcribers for seven Turkic languages, this transducer can be used for transliteration, Cyrillic/Arabic scripts.

If that is the case, may I ask you to share some instructions or at least some entry points?

Thank you in advance!

Best regards, Anton.

jonorthwash commented 5 months ago

Hi Anton!

There are a couple ways to do this. All involve compiling apertium-kir, and then running make in dev/ortho. Then:

$ echo "кыргыз" | hfst-lookup cyr-ara.hfst
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> кыргыз        قىرعىز  6.000000

Or

$ hfst-fst2fst -Oo kir@Cyrl-kir@Arab.hfst cyr-ara.hfst
$ echo "кыргыз тили" | hfst-proc  kir@Cyrl-kir@Arab.hfst
^кыргыз/قىرعىز$ ^тили/تئلئ/تئلى/تىلئ/تىلى$

I can work on a slightly more user-friendly approach using an apertium mode (as I believe apertium-kaz has). Also note that it's currently setup for accepting Perso-Arabic script, not generating it accurately, so some additional fine-tuning of the mapping rules to Cyrillic would be needed if this is the use you plan for it. Let me know if that'd be helpful (and also feel free to contribute yourself).

alexeyev commented 4 months ago

Dear @jonorthwash, thank you for such a swift response!

I've read a paper on AgglutiFit and I've realized that an open source tool allowing the conversion from Perso-Arabic script into Cyrillic should be a useful instrument both for those interested in Kyrgyz language in general and for NLP research purposes as well. I've found some online tools (web services) only, tried implementing something myself, and only then realized that the approach from your paper must be a perfect fit.

However, so far my curiosity is not rooted in any particular research project, therefore there is no rush (at least for me) for making the transliterator even more user-friendly.

I'll try out the scripts and instructions that you have kindly provided ASAP and will get back to you if that is ok.