Unwanted Merging of Words in Phonemizer Output

bootphon / phonemizer

Simple text to phones converter for multiple languages

GNU General Public License v3.0

1.19k stars 166 forks source link

There is no option to prevent merging of consecutive words in phonemizer output. Take the following code for instance

from phonemizer.separator import Separator
from phonemizer.backend import EspeakBackend

sep = Separator(phone=' ', word=' | ')
backend = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True)

text = "michael vaughan served as england captain for the test team"
ph = backend.phonemize([text], separator=sep, strip=True)[0]
print(ph)

The output is

m ˈaɪ k əl | v ˈɔː n | s ˈɜː v d | æ z | ˈɪ ŋ ɡ l ə n d | k ˈæ p t ɪ n | f ɚ ð ə | t ˈɛ s t | t ˈiː m

the consecutive words for the are not separated in the phonemizer. I see no option to disable this merging. This is a problem when we want to map the output phones to the input words

Phonemizer Version

phonemizer-3.2.1
available backends: espeak-ng-1.50, segments-2.2.1
uninstalled backends: espeak-mbrola, festival

Python Version: Python 3.9.18

bootphon / phonemizer

Unwanted Merging of Words in Phonemizer Output #172