Closed harshbafna closed 4 years ago
This is due to decoder size (In Tacotron2) and mean training audio length (for LJ-Speech-Dataset it is 10.10s corresponding to mean of 17 words per training sample).
You should split text in some smaller sub portions u can use mean value from dataset. Dunno about total characters' length in LJ but for my fine-tuned model it depend on dataset and best working examples are between 120 and 200 characters long
@harshbafna as for the 1001KB file size, the maximum audio length is determined by the --max-decoder-steps
variable which is set by default to 2000 steps:
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/arg_parser.py#L72 https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py#L585
We could successfully run inference up to 2000 steps, beyond that the audio started to lose quality for the reason @machineko explained.
Thanks for the detailed explanation @machineko & @GrzegorzKarchNV .
Correct me if I am wrong the,
torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
is having --max-decoder-steps of 1000 not 2000.
Related to Model/Framework(s) WaveGlow model for generating speech from mel spectrograms (generated by Tacotron2)
PyTorch/SpeechSynthesis/Tacotron2
Describe the bug I am trying to execute the pre-trained waveglow example given here : https://pytorch.org/hub/nvidia_deeplearningexamples_waveglow/ with a different text as input
The audio generated is completely distorted.
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
pip install numpy scipy librosa unidecode inflect librosa
I also observe following warning message when I execute above code :
It always generates a 1001 KB file for even longer text size.
Expected behavior A clear audio should get generated using the model
Environment