Problem with diacritics and transliterating to lists

dmort27 / epitran

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)

MIT License

625 stars 120 forks source link

lang_code = 'fra-Latn' epi = epitran.Epitran(lang_code) print(epi.trans_list(u"mobilisèrent")) print(epi.trans_delimiter(u"mobilisèrent")) print(epi.trans_delimiter(u"mobilisèrent", delimiter='~'))

Ok, for anyone facing this same issue, I have written a solution for postprocessing the delimited strings:

def split_ipa(transliterated_text, delimiter='|'):
    # Split the string based on the specified delimiter
    parts = transliterated_text.split(delimiter)

    # Initialize an empty list to hold the corrected segments
    corrected_parts = []

    # Loop through the parts to reattach any diacritics to their base character
    for part in parts:
        if corrected_parts and unicodedata.category(part[0]) == 'Mn':
            # If the part starts with a diacritic, attach it to the previous part
            corrected_parts[-1] += part
        else:
            # Otherwise, add the part to the list as a new segment
            corrected_parts.append(part)

    return corrected_parts

Now if you run the following code the delimited string is correctly split:

lang_code = 'fra-Latn' 
epi = epitran.Epitran(lang_code)
enc = epi.trans_delimiter(u"mobilisèrent", delimiter='|')
print("Original split:", enc.split('|'))
print("Corrected split:", split_ipa(enc))

This outputs:

Original split: ['m', 'ɔ', 'b', 'i', 'l', 'i', 'z', 'ə', '̀', 'ʀ', 'ɑ̃']
Corrected split: ['m', 'ɔ', 'b', 'i', 'l', 'i', 'z', 'ə̀', 'ʀ', 'ɑ̃']

dmort27 / epitran

Problem with diacritics and transliterating to lists #174