dmort27 / epitran

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
MIT License
608 stars 121 forks source link

Function xsampa_list() in _eptiran.py deletes things a lot #48

Open m-wiesner opened 4 years ago

m-wiesner commented 4 years ago

For instance in cebuano

felix --> [e, l, i] x --> []

In swedish

och --> []

I fixed this (I think), by simply replacing the commented line below with the uncommented one. Maybe this is horribly wrong, but it seems to work now.

ipa_segs = self.ft.ipa_segs(self.epi.strict_trans(word, normpunc,

    #                                                  ligaturize))

ipa_segs = self.ft.segs_safe(self.epi.transliterate(word, normpunc, ligaturize))

dmort27 commented 4 years ago

The deletion is by design. The applications for which this method were originally designed required that only segments that were converted from orthography to IPA be present in the X-SAMPA output. The Epitran.strict_trans method does that. Epitran.transliterate allows every character that cannot be mapped to IPA to "pass through" to the output. In noisy data this can produce some unexpected results. For example, many of the output segments will not be valid IPA and cannot be converted to X-SAMPA.

(I'm confused by the Swedish example, though. This appears to be due to errors in the mapping file.)

If you want something like this, the best solution is to add another method, rather than change the existing one which already does what we want it to do. Submit a pull request and I'll add this.

m-wiesner commented 4 years ago

Thanks for the answer. I thought it might be something like that at first, but the swedish example also seemed strange to me.