Open ErfolgreichCharismatisch opened 3 years ago
might be related to my issue #502 since I get the same error iirc- i still don't know the answer either though, sorry.
https://github.com/NVIDIA/tacotron2/blob/185cd24e046cc1304b4f8e564734d2498c6e2e6f/hparams.py#L59
You can change the max number of steps, however the model can have issues above 1000.
Normally this statement means you have input too much text to the model for inference or your model is not trained well.
your model is not trained well.
That's what I am referring to.
--warm_start
fromIf things still fail.
- Ensure you have a large amount of data (30+ minutes)
- Use a pretrained model as a base to
--warm_start
from- Ensure all transcripts in your dataset match the audio perfectly
- Remove any files with background noise
- Perform inference in batches and filter out spectrograms with poor alignments or use beam search/greedy search style inference with chunks of frames.
If things still fail.
- Decrease hop_length if your speaker speaks abnormally fast and retrain the vocoder + tacotron from scratch on a large dataset before transfer-learning OR
- Use a different model that uses duration based alignment
this will be helpful for my issue too I'm sure but... I'm honestly new to this and have no clue what some of this means. specifically:
2: where do you put --warm_start and such? how do you use that? 5: can you describe this again on a complete-idiot's level?
@Fennecai https://github.com/NVIDIA/tacotron2#training-using-a-pre-trained-model Example usage is on the README.
5: can you describe this again on a complete-idiot's level?
Not really. This is for the people who can write the inference code more than people who are just trying to use it. But it's a method of filtering out poor results + instability that works quite well.
that works quite well.
I have to agree. It's good to know that if there's an error, it's not in the original model so the area where problems arise is very limited and under your control. I learnt it the hard way that a fastidiously fostered dataset is the be-all and end-all.
@CookiePPP hi, can you tell me what is the maximum duration of a file .wav in dataset. (because some file in my dataset > 11.5s and all file < 12s) Thanks
@toanil315
There is no limit to the maximum duration of your training/validation audio files. The max_decoder_steps
is used when generating new audio using Tacotron2 + WaveGlow with new text.
In fact, training on longer audio files will increase the max_decoder_steps
that you can use safely with the model.
(a dataset with maximum duration of 5s lets you use 1250 max_decoder_steps
, a dataset with maximum duration of 10s lets you use 2000 max_decoder_steps
)
@CookiePPP thanks for reply, have good day
Whenever you get the max decoder steps reached in inference, your audio text pairs have errors. You have to train from scratch again with good pairs.