bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.19k stars 166 forks source link

Unwanted Merging of Words in Phonemizer Output #172

Open shreeshailgan opened 1 month ago

shreeshailgan commented 1 month ago

There is no option to prevent merging of consecutive words in phonemizer output. Take the following code for instance

from phonemizer.separator import Separator
from phonemizer.backend import EspeakBackend

sep = Separator(phone=' ', word=' | ')
backend = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True)

text = "michael vaughan served as england captain for the test team"
ph = backend.phonemize([text], separator=sep, strip=True)[0]
print(ph)

The output is

m ˈaɪ k əl | v ˈɔː n | s ˈɜː v d | æ z | ˈɪ ŋ ɡ l ə n d | k ˈæ p t ɪ n | f ɚ ð ə | t ˈɛ s t | t ˈiː m

the consecutive words for the are not separated in the phonemizer. I see no option to disable this merging. This is a problem when we want to map the output phones to the input words

Phonemizer Version

phonemizer-3.2.1
available backends: espeak-ng-1.50, segments-2.2.1
uninstalled backends: espeak-mbrola, festival

Python Version: Python 3.9.18

mmmaat commented 1 month ago

There is no solution at the phonemizer level because the merge occurs within espeak. Here is the raw output from espeak we get back in the phonemize function:

$ espeak-ng -x -q --ipa -l en-us "michael vaughan served as england captain for the test team"
mˈaɪkəl vˈɔːn sˈɜːvd az ˈɪŋɡlənd kˈaptɪn fəðə tˈɛst tˈiːm

As you can see espeak merges "for the" in its output. I don't see any general rule to implement a fix in phonemizer. The only thing I see is to use words-mismatch option to detect the problematic sentences and manually fix them in a post-processing step...

If someone has a better idea, I'm all ears :smile: