bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.15k stars 163 forks source link

Persian Language Support and Punctuation Preservation #158

Closed MmdrezaMolavi closed 9 months ago

MmdrezaMolavi commented 9 months ago

1: Farsi Language Support

Currently, the espeak backend only supports the "fa" locale for the Farsi language. When using "fa-latn," an error is encountered, stating that the language "fa-latn" is not supported.

2: Punctuation Preservation

In the conversion to phoneme, the punctuation comma specific to the Persian language (،) is not preserved as expected.

mmmaat commented 9 months ago

Hi,

In my case both works:

$ echo 'این یک امتحان است.' | phonemize -b espeak -l fa-latn
iːn jek emtehɑn ast 
$ echo 'این یک امتحان است.' | phonemize -b espeak -l fa
iːn jek emtehɑn ast
$ phonemize --version
phonemizer-3.2.1
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1

In your case it may be an issue with your espeak installation.

  1. This may be a bug, please provide an exemple. In any case you can specify which characters are punctuation in the following option https://github.com/bootphon/phonemizer/blob/master/phonemizer/backend/espeak/espeak.py#L38 (so you can ignore the comma)
MmdrezaMolavi commented 9 months ago

@mmmaat Thank you very much for your help!

To address Problem 1, I've updated the espeak-ng version.

For Problem 2, I've successfully preserved the '،' when converting to phonemes using the 'preserve_punctuation' parameter. Here's an example:

text = 'تست این ماژول، برای انجام تبدیل متن به فنوم، نتیجه قابل مشاهده است.' farsi_phonemizer = phonemizer.backend.EspeakBackend(language='fa-latn', preserve_punctuation=True, punctuation_marks='،') print(farsi_phonemizer.phonemize([text]))

This configuration has resolved the issue, and I appreciate your help with it.