[Bug] If sentence too long, some part will be missing during audio file generation #1680

Closed hengway closed 2 years ago

hengway commented 2 years ago

Describe the bug

If a sentence too long (separate by comma) some part of it will missing during the audio generation

Example: On April 1, 1942, Desmond Doss joined the United States Army. Little did he realize that three and a half years later, he would be standing on the White House lawn, receiving the nations highest award for his bravery and courage under fire. Of the 16 million men in uniform during World War 2, only 431 received the Congressional Medal of Honor.

The missing part will be: he would be standing on the White House lawn, receiving the nations highest award for his bravery and courage under fire.

To work around, shorten the sentence by replace comma with full stop: On April 1, 1942, Desmond Doss joined the United States Army. Little did he realize that three and a half years later. He would be standing on the White House lawn, receiving the nations highest award for his bravery and courage under fire. Of the 16 million men in uniform during World War 2, only 431 received the Congressional Medal of Honor.

To Reproduce

Run below command tts --text "On April 1, 1942, Desmond Doss joined the United States Army. Little did he realize that three and a half years later, he would be standing on the White House lawn, receiving the nations highest award for his bravery and courage under fire. Of the 16 million men in uniform during World War 2, only 431 received the Congressional Medal of Honor. " --model_name "tts_models/en/ljspeech/tacotron2-DDC_ph" --out_path /var/data/The-unlikely-hero5.wav

Expected behavior

Able to generate whole audio file


ubuntu@ubuntu:~$ tts --text "On April 1, 1942, Desmond Doss joined the United States Army. Little did he realize that three and a half years later, he would be standing on the White House lawn, receiving the nations highest award for his bravery and courage under fire. Of the 16 million men in uniform during World War 2, only 431 received the Congressional Medal of Honor. " --model_name "tts_models/en/ljspeech/tacotron2-DDC_ph" --out_path /opt/tts_output/The-unlikely-hero5.wav
Additional context

No response

erogol commented 2 years ago

For Tacotron models there is a cap of 250 chars not to crash your memory. You need to set it manually if you wanna change it.

genglinxiao commented 10 months ago

I'm also looking for methods to generate long sentences. What I've found is, the limit is actually in the tokenizer, and is hard coded:

class VoiceBpeTokenizer: def __init__(self, vocab_file=None): self.tokenizer = None if vocab_file is not None: self.tokenizer = Tokenizer.from_file(vocab_file) self.char_limits = { "en": 250, "de": 253, "fr": 273, "es": 239, "it": 213, "pt": 203, "pl": 224, "zh-cn": 82, "ar": 166, "cs": 186, "ru": 182, "nl": 251, "tr": 226, "ja": 71, "hu": 224, "ko": 95, }

So you can simply modify the limit. However, I'm not sure about the downstream effect.

FurkanGozukara commented 10 months ago

For Tacotron models there is a cap of 250 chars not to crash your memory. You need to set it manually if you wanna change it.

what is limit for TTS V2? I saw in code 400 tokens

m000lie commented 6 months ago

For Tacotron models there is a cap of 250 chars not to crash your memory. You need to set it manually if you wanna change it.

how much memory is it expected to use per char? i have access to 1x H100 SCM 80GB. surely memory shouldn't be a problem right?