bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.19k stars 166 forks source link

Phoneme Separator sometimes adds additional separators trailing the end of a word #31

Closed RhyanJohnson closed 4 years ago

RhyanJohnson commented 4 years ago
from phonemizer.separator import Separator
from phonemizer.phonemize import phonemize

_text = 'The lion and the tiger ran.'
_separator = Separator(phone='*')
test_ph = phonemize(_text, separator=_separator, strip=True, njobs=1, backend='espeak', language='en-us')

print(test_ph)

results in

ð*ə l*aɪə*n** æ*n*d ð*ə t*aɪ*ɡ*ɚ ɹ*æ*n

with two separators attached to the end of the phonemized 'lion'.

As opposed to

_separator = phonemizer.separator.Separator(phone='*')
test_ph = phonemize('The lion ran.', separator=_separator, strip=True, njobs=1, backend='espeak', language='en-us')
print(test_ph)

resulting in

ð*ə l*aɪə*n ɹ*æ*n

without trailing separators.

I noticed this around the following samples as well: the hello but the gives ð*ə h*ə*l*oʊ** b*ʌ*t ð*ə Here there and everywhere gives h*ɪɹ ð*ɛɹ** æ*n*d ɛ*v*ɹ*ɪ*w*ɛɹ He was hungry and tired. gives h*iː w*ʌ*z h*ʌ*ŋ*ɡ*ɹ*i** æ*n*d t*aɪɚ*d He was hungry but tired. gives h*iː w*ʌ*z h*ʌ*ŋ*ɡ*ɹ*i** b*ʌ*t t*aɪɚ*d The tiger or the lion gives ð*ə t*aɪ*ɡ*ɚ** ɔːɹ ð*ə l*aɪə*n The lion or the tiger gives ð*ə l*aɪə*n** ɔːɹ ð*ə t*aɪ*ɡ*ɚ

I noticed it around conjunctions like 'and', 'but, and 'or', but not always: Lions and tigers and bears, oh my! gives l*aɪə*n*z æ*n*d t*aɪ*ɡ*ɚ*z** æ*n*d b*ɛɹ*z oʊ m*aɪ Lions and tigers run together gives l*aɪə*n*z æ*n*d t*aɪ*ɡ*ɚ*z ɹ*ʌ*n t*ə*ɡ*ɛ*ð*ɚ

mmmaat commented 4 years ago

Thanks for reporting that, I'm investigating...

RhyanJohnson commented 4 years ago

Thank you! I am as well. It's seems to be coming from espeak first, as a result of

command = '{} -v{} {} -q -f {} {}'.format(self.espeak_exe(), self.language, self.ipa, data.name, self.sep)
line = subprocess.check_output(shlex.split(command, posix=False)).decode('utf8')

before we swap the espeak separator with the one provided to phonemize.

Perhaps I should be reporting this to espeak/espeak-ng instead?

mmmaat commented 4 years ago

Indeed !

$ echo "The lion and the tiger ran" | espeak-ng -x --ipa -q --sep=_
ð_ə l_ˈaɪə_n__ a_n_d ð_ə t_ˈaɪ_ɡ_ə ɹ_ˈa_n

Ok you report to espeak-ng as well, for now I'll add some fix in the phonemizer code.

RhyanJohnson commented 4 years ago

Sounds good - thanks very much for your promptness!

mmmaat commented 4 years ago

Just adding that seems to solve the problem, thank you again!

RhyanJohnson commented 4 years ago

Works for me - thanks for the quick fix! Happy to help (: