Open Mo-MR-123 opened 4 months ago
You could run the synthesis separately for each sentence?
Also, is there a reason to use
In our fork print
has been replaced with proper Python logging.
Hi @eginhard,
Thanks for the feedback. I didn't know there was a fork without print
, thanks for noting that!
I can indeed run the synthesis for each sentence separately, but wouldn't that mean that I need to manually merge the audio wav output of each synthesis? Or is there another way to approach this in a cleaner way?
On a side note, I noticed that when splitting a string into sentences (a sentence in this case always ends with a period) and passing the bulk of sentences to the model for synthesis, that the output contains parts where the voice goes faster than normal or repeats some parts at the end of the sentence. Do you know what could contribute to these anomalies and how to resolve them to get a near clean output sound? I saw many warnings about the text length exceeding the character limit and I suspect this has to do with it. Any ideas how to resolve this?
I can indeed run the synthesis for each sentence separately, but wouldn't that mean that I need to manually merge the audio wav output of each synthesis? Or is there another way to approach this in a cleaner way?
Yes, Coqui does the same internally after splitting sentences, you can see how wavs
are built here: https://github.com/idiap/coqui-ai-TTS/blob/6ea3b75b8466c064cf3a98645de5bab6060a2e43/TTS/utils/synthesizer.py#L290
On a side note, I noticed that when splitting a string into sentences (a sentence in this case always ends with a period) and passing the bulk of sentences to the model for synthesis, that the output contains parts where the voice goes faster than normal or repeats some parts at the end of the sentence. Do you know what could contribute to these anomalies and how to resolve them to get a near clean output sound? I saw many warnings about the text length exceeding the character limit and I suspect this has to do with it. Any ideas how to resolve this?
That's just how the XTTS model works, no way to completely avoid it. But splitting sentences helps because the model wasn't trained on very long inputs - that's what the warnings are for.
Also in the utils -> synthesizer.py, it always assumes the language as English, so other languages' texts are not being processed correctly.
self.seg = self._get_segmenter("en")
Good catch @sonipranjal !
Have you tested it with sentences from various languages? Did you see clear differences of how sentences are split in various languages?
My advice would then be to somehow pass the language given through tts_to_file
to the segmenter. Either before creating it or creating a new one with the correct language and discard the one initialized in init.
Now I have copied the whole repo locally and did the change, it seems to be working now
Also in the utils -> synthesizer.py, it always assumes the language as English, so other languages' texts are not being processed correctly.
For XTTS there is also a different splitting method that at least treats some non-European languages separately: https://github.com/idiap/coqui-ai-TTS/blob/6ea3b75b8466c064cf3a98645de5bab6060a2e43/TTS/tts/layers/xtts/tokenizer.py#L22 But yes, depending on your language you might want to handle the splitting yourself.
FYI: I found the following repo of models trained to recognize sentence boundaries in different languages which is shown to work better than PyBSD library in most, if not all, cases (link: https://github.com/segment-any-text/wtpsplit).
Also checkout their paper to see which version of their models has the highest accuracy for your language.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Describe the bug
When passing a list of custom split sentences using a custom split function, the TTS model (
tts_models/multilingual/multi-dataset/xtts_v2
to be specific) withsplit_sentences=False
throws following error:After some fiddling around in the TTS code, I noticed that the
synthesizer.tts
function (inTTS.utils
) always assumes the input is a string and not a list of strings (which is essential when custom split function needs to be used). This is the case regardless ofsplit_sentences
param is False or True, even though for split_sentences=True a list of strings is not expected as that is done internally.To Reproduce
Expected behavior
I expect that a list of string given as text should be acceptable. So passing a list of strings in place of
text
should be acceptable whensplit_sentences=False
is used.Logs
No response
Environment
Additional context
An Idea on how to solve this issue:
1- Use the tokenizer to check how many tokens are acceptable at once (assuming text argument is a string). If the text doesn't fit with max. context acceptable by the model, split sentences using a custom provided function (in case of
split_sentences=False
) or split it using existing internal function for splitting sentences (in case ofsplit_sentences=True
). Sotts_to_file
function or any other function used to synthesize TTS should accept a param called e.g. "custom_split_fn" in casesplit_sentences=False
. In this casetext
can always stay as a string.2-
In
TTS/utils/synthesizer.py
should be temporarily changed to until a cleaner solution (idea noted above or similar) is implemented:Also, is there a reason to use
print
instead of a logger? Why is sens printed here? IMO this should only be acceptable during debugging.