Best way to get a delimited transliteration with XSampa

ruohoruotsi commented 2 years ago

Hi @dmort27, thanks for your work on epitran. I'm using it for Yorùbá g2p to generate XSampa spellings. It's brilliant! I'm showing off my usage below with a word with underdots diacritics & tonal marks.

epi = epitran.Epitran('yor-Latn')

# 'ọọ̀run' => IPA string: 'ɔɔ̀rũ'
ipa_word = epi.transliterate("ọọ̀run") 

xs = XSampa()
xsampa_word = xs.ipa2xs(ipa_word) # 'OO_Lru~'

this works fine and great ... however I need my spelling like O O_L r u~.

So I tried:

trans_delimiter works but only on IPA and if I pass that space delimited IPA string to xs.ipa2xs, the delimiter gets removed ..see 3. below.
So I tried xsampa_list which looks like what I want, but unfortunately as someone else has noted in this issue, this function uses strict_trans which throws out all the tonal information 😬
Digging into the XSampa class I see that ultimately, on Line 67 return ''.join(xsampa), so no chance of a delimiting, unless I modify this function to support an optional delimiter to use to join the list with.

Sooooooo before I get too excitable and start hacking up stuff, I wanted to ask you if there's something I've missed or misused in order to get "space delimited" XSampa phonetic spellings? Thank you in advance 🙏

dmort27 commented 2 years ago

Closed by accident. Looking into this. It would be easy to make a method like xsampa_list that does not use strict_trans. But I think that fixing the tonal support for Yoruba would address this problem.

ruohoruotsi commented 2 years ago

Thanks for responding so quickly! Do you mean fixing tonal support with strict_trans for Yorùbá? Cos tonal support works great with regular transliterate. In my example above (sorry if this is dead obvious) O O_L r u~ captures the low-tone on the second o perfectly and as far as I can tell does a consistently good job on large sections of my lexicon.

ruohoruotsi commented 2 years ago

I need to read about what strict_trans is doing, to get more context on how to fix 🤔 👍 🙏

dmort27 commented 2 years ago

Sorry to have let this drop. I'll try to look into in in the next day or two. It shouldn't be too hard to fix, but there is a technical problem to be addressed with regards to the tones.

ruohoruotsi commented 2 years ago

Thanks David, I made a work around by cloning the XSampa class within my g2p and heavily modifying the ipa2xs function. Initially I thought I could just get away with a space delimiting ' '.join() but in the end I had to handle these special_phones = ['_L', '_H', '~'] (low-tone, high-tone and nasalizations) and ensure they were not spaced, but attached to their "parent" phoneme ... so I have a little extra post-processing to ensure these are correctly placed ... it's a bit hacky, but works, as a take-1.

the main changes are within IO HAVOC comments. Incidentally, in testing out my code, I found some dodgy text that needed to be fixed in my corpora (tones without base chars & other oddities)

ruohoruotsi commented 2 years ago

In any case, I used the generated lexicon from this ☝️ ☝️ & epitran to make a Yorùbá asr, using Kaldi, only GMM-HMM triphone thus far, since I have tiny data, but everything more or less worked ... WER is still high 80% but to be expected at this stage.

dmort27 / epitran

Best way to get a delimited transliteration with XSampa #99