Closed eginhard closed 1 year ago
It is possible but needs a model that is capable of doing that. I close this since it is not a dev issue.
The output is fine with the tts_models/en/ljspeech/vits
model. E.g. "ABC CNN ESPN" is correctly read letter by letter because all words are marked as abbreviations in the Espeak dictionary.
Espeak doesn't know about other letter sequences, like "ARD" or "ABCDEFGHIJKLMNOPQRSTUVWXYZ", and tries to read them as a word. I can force it to phonemize them as letter sequences by adding periods between each letter, but Coqui strips all punctuation before calling Espeak. Changing https://github.com/coqui-ai/TTS/blob/bc0a532c7a7e165338da6711ea362cdbe9761820/TTS/tts/utils/text/phonemizers/base.py#L104 to return [text], []
fixes this and results in correct letter-by-letter output for "A.R.D" and "A.B.C.D.E.F.G.H.I.J.K.L.M.N.O.P.Q.R.S.T.U.V.W.X.Y.Z". But I'm not sure if there is a specific reason that Coqui strips the punctuation there and changing it wouldn't cause other issues?
Describe the bug
I'd like to force the TTS model to pronounce a word letter by letter, e.g. "ARD" should be pronounced "A R D" (/ˌeɪˌɑːɹdˈiː/). In systems with SSML support (#752) you could use
<speak><say-as interpret-as="verbatim">ard</say-as></speak>
, but another way would be fine as well.Espeak supports this even for words not in its dictionary by adding periods between the characters:
espeak-ng --ipa -v en-us "A.R.D."
is read /ˌeɪˌɑːɹdˈiː/.This doesn't work in Coqui because the input for Espeak is split at punctuation characters and each chunk
["A", "R", "D"]
is phonemized separately: https://github.com/coqui-ai/TTS/blob/bc0a532c7a7e165338da6711ea362cdbe9761820/TTS/tts/utils/text/phonemizers/base.py#L129This results in the word, not the letter pronunciation of "a" being chosen (ɐ instead of eɪ). I could change
_phonemize_preprocess()
to pass the input to Espeak with punctuation included, but I'm not sure about the side effects. Is there a specific reason to do it this way?To Reproduce
Output:
'ˈɐ.ˈɑːɹ.d|ˈiː.'
Expected behavior
Expected output:
ˌeɪˌɑːɹdˈiː
Logs
No response
Environment
Additional context
No response