dmort27 / epitran

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
MIT License
630 stars 121 forks source link

Incorrect grapheme-phoneme alignment in word_to_tuple response #44

Closed mashabelyi closed 4 years ago

mashabelyi commented 4 years ago

Thank you for this great tool! I was hoping to use Epitran to extract frequencies of grapheme-phoneme alignment in different languages. But I am running into issues when using the word_to_tuples and word_to_segs features.

Here is the output of epi.word_to_tuples for the word tough in English

('L', 0, 't', 't', [('t', <map object at 0x113817c50>)])
('L', 0, 'o', 'ʌ', [('ʌ', <map object at 0x113817250>)])
('L', 0, 'u', 'f', [('f', <map object at 0x1120a06d0>)])
('L', 0, 'g', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'h', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])

Here is the output for choice

('L', 0, 'c', 't͡ʃ', [('t͡ʃ', <map object at 0x11380cad0>)])
('L', 0, 'h', 'o', [('o', <map object at 0x11380c5d0>)])
('L', 0, 'o', 'j', [('j', <map object at 0x11380cb10>)])
('L', 0, 'i', 's', [('s', <map object at 0x1120a0fd0>)])
('L', 0, 'c', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'e', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])

I'd expect the phonetic form /f/ in tough to correspond to either g or h. And the phonetic form /s/ in choice to correspond to c. However, that's not the case. I am wondering if this is expected behavior or a bug?

dmort27 commented 4 years ago

Sorry for the late response. The answer is that Epitran was not made to do what you want to do (extract phoneme-grapheme alignments). The behavior you is expected—these methods were added with a very specific application in mind which did not require accurate alignments between the two representations, only some alignment. Perhaps this code should be removed. In any case, Epitran, because of its architecture, will only get you part way to phoneme-grapheme alignments (phonemic representations). You must do the rest with an aligner.

mashabelyi commented 4 years ago

Got it, thanks for your response.