[Waveglow/Pytorch] Pre-trained waveglow/tacotron model from torchhub does not work for longer text

harshbafna commented 4 years ago

Related to Model/Framework(s) WaveGlow model for generating speech from mel spectrograms (generated by Tacotron2)

PyTorch/SpeechSynthesis/Tacotron2

Describe the bug I am trying to execute the pre-trained waveglow example given here : https://pytorch.org/hub/nvidia_deeplearningexamples_waveglow/ with a different text as input

The audio generated is completely distorted.

Install latest pytorch for CUDA 10.1

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

Install dependecy packages

pip install numpy scipy librosa unidecode inflect librosa

Run following script (only change from the above example is the device type as "cuda:1" instead of default cuda"

import torch
import numpy as np
from scipy.io.wavfile import write

waveglow = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_waveglow')

waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

tacotron2 = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

text = """
Education bring positive changes in human life. It enhances the knowledge, skill, and intelligence of a person and enables him to lead a successful life.

Child education: Children or kids start going to school to get the primary or elementary education. It is considered a human right for every child to get the opportunity for education. School education lays the foundation stone for the child’s future.

A girl child is as important as a boy child. She too has the right to go to schools. Her rights to access education should not be compromised at any cost.

Education at colleges, universities and professional institutes: After completing education at schools, a student may consider joining a college, or a professional institute for higher studies. He can acquire a bachelors or a masters degree, or he can join a professional institute to acquire expertise in specific discipline.

Adult Literacy: Illiteracy is a social evil. An illiterate person finds it very difficult to cope up with various aspects of life that involves reading writing or arithmetical calculations. Nowadays, adult men and women are going to education centers to learn the basics of education. These adults also get health and hygiene related education.

Women Education: Educating women is an essential step towards strengthening the position of women in the society. A modern educated woman give due importance to her social life as well. Education broadens her outlook. It helps in developing her personality.

Advantages of educationEducation makes us humble. Education creates awareness and expands our vision. We become more aware about our-self, about the society, about everything that surrounds and affect our life.It helps us develop a disciplined life. And, discipline is essential for everything that a person wants to achieve in life.An educated person commands respect in the society.Education enables us to earn our livelihood. Education empowers us to get a good job.We need money to make our living. With the advancement of science and technology, our needs have increased. Besides the basic needs of life such as food, shelter and clothing, we also need other comforts such as mobile phones, air-conditioners, car, etc. A fulfilling career ensures a satisfied life.It is a known fact that an educated person gets better earning opportunities. After completing education, we can consider starting your own business. We can also become a consultant in the area of our expertise.The study of computer science, software, and information technology will empower us to make a choice in the field of fast growing IT and internet industry.We can help illiterate adults to learn the basic skills of reading, writing and arithmetic.Importance
Education is of utmost importance for eradicating the unemployment problem of our country. It is also essential to improve the trade and commerce, and to bring prosperity to our country. However, apart from an improved system of general education, there is a great need for the growth of vocational education.

Conclusion
A student must be familiar with the history, geography, religion, culture and tradition, through general education. Therefore, general education should aim at educating all students up to the secondary standard. Thereafter, depending upon the aptitude of the student, he should either opt for advanced academic education or join a vocational training institute for skill-based training.
"""

sequence = np.array(tacotron2.text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)

with torch.no_grad():
    _, mel, _, _ = tacotron2.infer(sequence)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

write("audio.wav", rate, audio_numpy)

I also observe following warning message when I execute above code :

ubuntu@ip-172-31-78-27:~$ python waveglow_sample.py 
Using cache found in /home/ubuntu/.cache/torch/hub/nvidia_DeepLearningExamples_torchhub
Using cache found in /home/ubuntu/.cache/torch/hub/nvidia_DeepLearningExamples_torchhub
Warning! Reached max decoder steps

It always generates a 1001 KB file for even longer text size.

Expected behavior A clear audio should get generated using the model

Environment

Manually installed PyTorch using above given command on a fresh conda environment using command provided above
8 NVIDIA Tesla V100 GPUs (Using AWS p3.8xlarge instance type)
Driver Version: 440.33.01

machineko commented 4 years ago

This is due to decoder size (In Tacotron2) and mean training audio length (for LJ-Speech-Dataset it is 10.10s corresponding to mean of 17 words per training sample).

You should split text in some smaller sub portions u can use mean value from dataset. Dunno about total characters' length in LJ but for my fine-tuned model it depend on dataset and best working examples are between 120 and 200 characters long

ghost commented 4 years ago

@harshbafna as for the 1001KB file size, the maximum audio length is determined by the --max-decoder-steps variable which is set by default to 2000 steps: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/arg_parser.py#L72 https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py#L585

We could successfully run inference up to 2000 steps, beyond that the audio started to lose quality for the reason @machineko explained.

harshbafna commented 4 years ago

Thanks for the detailed explanation @machineko & @GrzegorzKarchNV .

apthagowda97 commented 4 years ago

Correct me if I am wrong the, torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2') is having --max-decoder-steps of 1000 not 2000.

NVIDIA / DeepLearningExamples

[Waveglow/Pytorch] Pre-trained waveglow/tacotron model from torchhub does not work for longer text #497