Is it possible to retrieve the phonamized text with punctuations?

bootphon / phonemizer

Simple text to phones converter for multiple languages

https://bootphon.github.io/phonemizer/

GNU General Public License v3.0

1.19k stars 166 forks source link

Is it possible to retrieve the phonamized text with punctuations? #32

Closed erogol closed 4 years ago

erogol commented 4 years ago

Here is my code to phonemize the text:

text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
seperator = phonemizer.separator.Separator(' |', '', '|')
ph = phonemize(text, separator=seperator, strip=False, njobs=1, backend='espeak', language=language)

previously it'd return comma for punctuations and I'd fix them with a regex but with the new version, punctuations are totally ignored. Is there anyway to keep the punctuations intact.

mmmaat commented 4 years ago

This may be related to #26. I'm looking for a fix. Thanks for reporting!

mmmaat commented 4 years ago

This is implemented in phonemizer-2.1. Let me know if this is not working as expected.

erogol commented 4 years ago

Now I can get the punctuations but the separator does not work well with it. So here is an example.

seperator = phonemizer.separator.Separator(' |', '', '|')
text = "how are. you today, my friend?"
phonemize(text, separator=seperator, strip=False, njobs=1, backend='espeak', language=language, preserve_punctuation=True)

so it generates

'h|aʊ| |ɑːɹ| |. j|uː| |t|ə|d|eɪ| |, m|aɪ| |f|ɹ|ɛ|n|d| |?'

but I guess it needs to be

'h|aʊ| |ɑːɹ|.| |j|uː| |t|ə|d|eɪ|,| |m|aɪ| |f|ɹ|ɛ|n|d|?'

If it is clear enough

mmmaat commented 4 years ago

Wel... this differs if you are considering the punctuation mark being part of the word or not... I answered "no" to that question while implementing.

echo "how are. you today, my friend?" | phonemize -p' ' -w'\w ' --preserve-punctuation
h aʊ \w ɑːɹ \w . j uː \w t ə d eɪ \w , m aɪ \w f ɹ ɛ n d \w ?

In the output you expect you want the punctuation sign as the last phoneme of a word... And actually this would give 2 instead of 1:


1. 'h|aʊ| |ɑːɹ|.| |j|uː| |t|ə|d|eɪ|,| |m|aɪ| |f|ɹ|ɛ|n|d|?'
2. 'h|aʊ| |ɑːɹ|. | |j|uː| |t|ə|d|eɪ|, | |m|aɪ| |f|ɹ|ɛ|n|d|?| '

erogol commented 4 years ago

@mmmaat Thx for pondering on this issue.

I don't really see how 1 and 2 are different. Yes I see the additional space, but at least for my usecase (training Text2Speech models), it does not provide any difference.

I 'd just prefer to have punctuations separated from the word. I guess this is also that you prefer.

mmmaat commented 4 years ago

Yes indeed! So I close the issue, thanks for your feedback.