I added an Arabic-Latin transducer and modified the Arabic-Cyrillic transducer to fix a few issues. I may have done some things in suboptimal ways out of lack of experience with hfst, so please correct me!
I think things basically work fine when they plug into the morphological transducer, but there are a few little issues when the transducers are inverted:
The hamza rules for both Ara-Lat and Ara-Cyr are intended to be enforce mandatory hamza insertion when going from Lat/Cyr -> Ara. E.g., always insert a hamza word initially before a vowel. In practice, the inverted transducer spits out forms both with and without the hamza:
Here the apostrophe between n and g should be removed, but this removal is again treated as optional and it's subsequently replaced by a hamza (Uyghur latin script uses apostrophes both to represent (some) hamzas and to disambiguate sequences of n and g from the single sound ng.
This is basically fine because one of them is the correct form (and the hamzas are removed by the morphological transducer anyways), but it would be nice to be able to rule out the hamza-less forms completely. Or is this even worth worrying about?
Both Arabic and Latin use bigrams that correspond to unigrams in Cyrillic/Arabic respectively. Running hfst-proc on the transducers produces the warning:
!! Warning: Transducer contains one or more multi-character symbols made up of
ASCII characters which are also available as single-character symbols. The
input stream will always be tokenised using the longest symbols available.
Use the -t option to view the tokenisation. The problematic symbol(s):
يا يۇ
This isn't really a problem (the longest tokenization is what we want), but it's sort of obnoxious. Any way to set things up differently?
Reversing a given transliterator will in theory work if it's implemented extremely precisely, but as you found out, there are limitations. The way I would do it is to just write a fresh transliterator for the other direction.
It's best to treat everything as an individual symbol, even if they're technically bigrams. This can mean multiple rules to deal with one thing, but that's okay.
I added an Arabic-Latin transducer and modified the Arabic-Cyrillic transducer to fix a few issues. I may have done some things in suboptimal ways out of lack of experience with
hfst
, so please correct me!I think things basically work fine when they plug into the morphological transducer, but there are a few little issues when the transducers are inverted:
The hamza rules for both Ara-Lat and Ara-Cyr are intended to be enforce mandatory hamza insertion when going from Lat/Cyr -> Ara. E.g., always insert a hamza word initially before a vowel. In practice, the inverted transducer spits out forms both with and without the hamza:
Here the apostrophe between
n
andg
should be removed, but this removal is again treated as optional and it's subsequently replaced by a hamza (Uyghur latin script uses apostrophes both to represent (some) hamzas and to disambiguate sequences ofn
andg
from the single soundng
.This is basically fine because one of them is the correct form (and the hamzas are removed by the morphological transducer anyways), but it would be nice to be able to rule out the hamza-less forms completely. Or is this even worth worrying about?
Both Arabic and Latin use bigrams that correspond to unigrams in Cyrillic/Arabic respectively. Running hfst-proc on the transducers produces the warning:
This isn't really a problem (the longest tokenization is what we want), but it's sort of obnoxious. Any way to set things up differently?