training for Persian - Githubissues

MahdiEsrafili commented 2 years ago

Hello. Thanks for your great work. I want to train the model for Persian data. In Persian we link some words based on context using 'Ezafe' which is not written but pronounced. for example, here is two words and phonemes: کیف: kif من: man But we read the sentence 'کیف من' as 'kife man' and not 'kif man' (Persian is written right to left). Also words pronunciations can differ based on their meanings. My question is that how can I change the model to consider these issues? Thanks

cschaefer26 commented 2 years ago

Hi, these context dependencies are generally not easy to solve. One option could be to train the model on n-grams of words (e.g. produce training data with 3 words at once = trigram) where you have ambiguity already resolved and apply accordingly to the text. Another option could be to distinguish the words via some kind of flag or added text (e.g use 'kife' instead of 'kif' according to the pronunciation) and then resolve the ambiguity before you use the phonemizer. We are currently working on a similar problem, namely finding English inclusions in German text and phonemizing them in the correct language. We went for the latter solution, first finding the English inclusions with a NER system and then using the standard phonemizer to do its job word-wise.

MahdiEsrafili commented 2 years ago

@cschaefer26 Thanks for your reply. It seems resolving ambiguity before using phonemizer will work better.

as-ideas / DeepPhonemizer

training for Persian #19