bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.21k stars 168 forks source link

single character unicode has the language name prefix #160

Closed dsplog closed 3 months ago

dsplog commented 10 months ago

Describe the bug when using the phonemizer on unicode single characters, the language name is coming as prefix

Phonemizer version home@home-desktop:$ phonemize --version phonemizer-3.2.1 available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1

System home@home-desktop:$ uname -a Linux home-desktop 5.15.0-88-generic #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] :: Anaconda, Inc. on linux

To reproduce

>>> import phonemizer
>>> phon = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True,language_switch='remove-flags')
>>> 
>>> text = 'ന'
>>> phon.phonemize([text], strip=True)
['mæleɪˈɑːləmnˈɐ']
>>> 
>>> text = '\u0d28'
>>> phon.phonemize([text], strip=True)
['mæleɪˈɑːləmnˈɐ']

Expected behavior the prefix 'mæleɪˈɑːləm' is not expected. is there a way to supress it btw, if i initialize the language as 'ml', the prefix is not there

>>> mlphon = phonemizer.backend.EspeakBackend(language='ml', preserve_punctuation=True,  with_stress=True,language_switch='remove-flags')
>>> mlphon.phonemize([text], strip=True)
['nˈɐ']

Additional context looks like the language_switch is not taking care of single characters

mmmaat commented 10 months ago

Hi, thanks for reporting. Unfortunately this is related to espeak implementation, not phonemizer itself:

$ phonemize --version
phonemizer-3.2.1
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1
$ echo 'ന' | espeak-ng -x -q --ipa -v en-us
mæleɪˈɑːləm(ml)nˈɐ(en-us)
$ echo 'ന' | espeak-ng -x -q --ipa -v ml
nˈɐ
$ echo 'ആനേ' | espeak-ng -x -q --ipa -v en-us
(ml)ˈaːneː(en-us)

I think this is a very special case... if you try with a word the problem is not here. I suggest you to write a custom post-process code, or to play with the regex detecting language-switches here.