coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.73k stars 4.22k forks source link

Error when passing a custom list with strings as text when `split_sentences=False`. #3826

Open Mo-MR-123 opened 3 months ago

Mo-MR-123 commented 3 months ago

Describe the bug

When passing a list of custom split sentences using a custom split function, the TTS model (tts_models/multilingual/multi-dataset/xtts_v2 to be specific) with split_sentences=False throws following error:

sent = sent.strip().lower()
           ^^^^^^^^^^
AttributeError: 'list' object has no attribute 'strip'

After some fiddling around in the TTS code, I noticed that the synthesizer.tts function (in TTS.utils) always assumes the input is a string and not a list of strings (which is essential when custom split function needs to be used). This is the case regardless of split_sentences param is False or True, even though for split_sentences=True a list of strings is not expected as that is done internally.

To Reproduce

from TTS.api import TTS
import torch

# This example list of strings is normally generated by a custom splitting function.
example = [ "This is a sample sentence.", "Another sample sentence." ]

dev = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(dev)
tts.tts_to_file(text=example, speaker_wav="any/sample/wav/here", language="en", file_path="test.wav", split_sentences=False)

Expected behavior

I expect that a list of string given as text should be acceptable. So passing a list of strings in place of text should be acceptable when split_sentences=False is used.

Logs

No response

Environment

- TTS version: 0.22.0
- Pytorch version: 2.3.0
- Python version: 3.11.9
- OS: Win 11
- CUDA version: 12.1
- installed pytorch using `python -m pip install torch==2.3.0 torchaudio==2.3.0 -i https://download.pytorch.org/whl/cu121`

Additional context

An Idea on how to solve this issue:

1- Use the tokenizer to check how many tokens are acceptable at once (assuming text argument is a string). If the text doesn't fit with max. context acceptable by the model, split sentences using a custom provided function (in case of split_sentences=False) or split it using existing internal function for splitting sentences (in case of split_sentences=True). So tts_to_file function or any other function used to synthesize TTS should accept a param called e.g. "custom_split_fn" in case split_sentences=False. In this case text can always stay as a string.

2-

if text:
            sens = [text]
            if split_sentences:
                print(" > Text splitted to sentences.")
                sens = self.split_into_sentences(text)
            print(sens)

In TTS/utils/synthesizer.py should be temporarily changed to until a cleaner solution (idea noted above or similar) is implemented:

if text:
            if isinstance(text, str):
                sens = [text]
            elif isinstance(text, list):
                sens = text
            else:
                raise ValueError(f"{text} is not of type string or list")

            if split_sentences:
                print(" > Text splitted to sentences.")
                sens = self.split_into_sentences(text)
            print(sens)

Also, is there a reason to use print instead of a logger? Why is sens printed here? IMO this should only be acceptable during debugging.

eginhard commented 2 months ago

You could run the synthesis separately for each sentence?

Also, is there a reason to use print instead of a logger?

In our fork print has been replaced with proper Python logging.

Mo-MR-123 commented 2 months ago

Hi @eginhard,

Thanks for the feedback. I didn't know there was a fork without print, thanks for noting that!

I can indeed run the synthesis for each sentence separately, but wouldn't that mean that I need to manually merge the audio wav output of each synthesis? Or is there another way to approach this in a cleaner way?

On a side note, I noticed that when splitting a string into sentences (a sentence in this case always ends with a period) and passing the bulk of sentences to the model for synthesis, that the output contains parts where the voice goes faster than normal or repeats some parts at the end of the sentence. Do you know what could contribute to these anomalies and how to resolve them to get a near clean output sound? I saw many warnings about the text length exceeding the character limit and I suspect this has to do with it. Any ideas how to resolve this?

eginhard commented 2 months ago

I can indeed run the synthesis for each sentence separately, but wouldn't that mean that I need to manually merge the audio wav output of each synthesis? Or is there another way to approach this in a cleaner way?

Yes, Coqui does the same internally after splitting sentences, you can see how wavs are built here: https://github.com/idiap/coqui-ai-TTS/blob/6ea3b75b8466c064cf3a98645de5bab6060a2e43/TTS/utils/synthesizer.py#L290

On a side note, I noticed that when splitting a string into sentences (a sentence in this case always ends with a period) and passing the bulk of sentences to the model for synthesis, that the output contains parts where the voice goes faster than normal or repeats some parts at the end of the sentence. Do you know what could contribute to these anomalies and how to resolve them to get a near clean output sound? I saw many warnings about the text length exceeding the character limit and I suspect this has to do with it. Any ideas how to resolve this?

That's just how the XTTS model works, no way to completely avoid it. But splitting sentences helps because the model wasn't trained on very long inputs - that's what the warnings are for.

sonipranjal commented 2 months ago

Also in the utils -> synthesizer.py, it always assumes the language as English, so other languages' texts are not being processed correctly.

 self.seg = self._get_segmenter("en")

https://github.com/idiap/coqui-ai-TTS/blob/6ea3b75b8466c064cf3a98645de5bab6060a2e43/TTS/utils/synthesizer.py#L89C14-L89C17

Mo-MR-123 commented 2 months ago

Good catch @sonipranjal !

Have you tested it with sentences from various languages? Did you see clear differences of how sentences are split in various languages?

My advice would then be to somehow pass the language given through tts_to_file to the segmenter. Either before creating it or creating a new one with the correct language and discard the one initialized in init.

sonipranjal commented 2 months ago

Now I have copied the whole repo locally and did the change, it seems to be working now

eginhard commented 2 months ago

Also in the utils -> synthesizer.py, it always assumes the language as English, so other languages' texts are not being processed correctly.

For XTTS there is also a different splitting method that at least treats some non-European languages separately: https://github.com/idiap/coqui-ai-TTS/blob/6ea3b75b8466c064cf3a98645de5bab6060a2e43/TTS/tts/layers/xtts/tokenizer.py#L22 But yes, depending on your language you might want to handle the splitting yourself.

Mo-MR-123 commented 2 months ago

FYI: I found the following repo of models trained to recognize sentence boundaries in different languages which is shown to work better than PyBSD library in most, if not all, cases (link: https://github.com/segment-any-text/wtpsplit).

Also checkout their paper to see which version of their models has the highest accuracy for your language.

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.