NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
892 stars 177 forks source link

Problems running "Inference demo" #74

Open latentcode opened 4 years ago

latentcode commented 4 years ago

I tried running the "Inference demo". The spectrograms seem reasonable (attached), but the wav file that was generated (sid0_sigma0.5.wav, 367 KB) has no sound. What follows is a list of all the issues I had, offered in the context of improvement for all, not to be construed as lack of appreciation for the work that was done so far! Thank you BTW.

1) I used this Docker image: nvcr.io/nvidia/pytorch:20.09-py3, but it's missing tensorboardX. NOTE: I've also run waveglow, but the TOT version of that repo doesn't work in this Docker. It leads to this error:

torch.nn.modules.module.ModuleAttributeError: 'Conv1d' object has no attribute '_non_persistent_buffers_set'

I had success running waveglow with: nvcr.io/nvidia/pytorch:20.03-py3

2) The instructions indicate:

python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "It is well know that deep generative models have a deep latent space!" -i 0

but waveglow_256channels_v4.pt is not provided. I found a few work arounds. I tried models from:

I didn't get a valid .wav file with any of these.

  1. I had to fix a problem with the code in inference.py. Here's the output:

    Loaded checkpoint 'models/flowtron_ljs.pt') Number of speakers : 1 Traceback (most recent call last): File "inference.py", line 132, in args.id, args.n_frames, args.sigma, args.gate, args.seed) File "inference.py", line 76, in infer axes[0].imshow(mels[0].cpu().numpy(), origin='bottom', aspect='auto') File "/opt/conda/lib/python3.6/site-packages/matplotlib/init.py", line 1438, in inner return func(ax, *map(sanitize_sequence, args), kwargs) File "/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_axes.py", line 5521, in imshow resample=resample, kwargs) File "/opt/conda/lib/python3.6/site-packages/matplotlib/image.py", line 905, in init **kwargs File "/opt/conda/lib/python3.6/site-packages/matplotlib/image.py", line 246, in init cbook._check_in_list(["upper", "lower"], origin=origin) File "/opt/conda/lib/python3.6/site-packages/matplotlib/cbook/init.py", line 2257, in _check_in_list .format(v, k, ', '.join(map(repr, values)))) ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', ‘lower'

I updated inference.py:

#axes[0].imshow(mels[0].cpu().numpy(), origin='bottom', aspect='auto')
#axes[1].imshow(attention[:, 0].transpose(), origin='bottom', aspect='auto')
axes[0].imshow(mels[0].cpu().numpy(), origin='lower', aspect='auto')
axes[1].imshow(attention[:, 0].transpose(), origin='lower', aspect='auto')

sid0_sigma0 5_attnlayer0 sid0_sigma0 5_attnlayer1

maggiezha commented 4 years ago

I got similar issues when I first ran it, now I can generate audio correctly and please try:

  1. I also used the NVIDIA NGC PyTorch container (same version as yours), after launching the container, you need to: export PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} apt-get update -y apt-get install -y ffmpeg libsndfile1 sox locales vim pip install --upgrade pip pip install -U numpy pip install librosa soundfile audioread sox matplotlib Pillow tensorflow==1.15.2 tensorboardX inflect unidecode natsort pandas jupyter tgt srt peakutils --ignore-installed certifi

Then I "docker commit" to a new image for my future use.

  1. For inference, I got this warning: /opt/conda/lib/python3.6/site-packages/torch/serialization.py:659: SourceChangeWarning: source code of class 'torch.nn.modules.conv.ConvTranspose1d' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning)

So I added: torch.nn.Module.dump_patches = True to inference.py

also for WaveGlow checkpoint, I tried your first one too but it only generated silent audio, I found this can work: https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ljs_256channels

  1. I did the same as you

Now I can generate good audio.

serg06 commented 4 years ago

I was able to get it working on Ubuntu without NGC by using Waveglow V3.

Then I tried the exact same thing on Windows and I can't get it to work. It's just silence every time.

rafaelvalle commented 3 years ago

your issue can be related to the pytorch version. make sure you're running the latest version and try inference in fp32.