NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

Inference on pre-trained model (flowtron_ljs) speaking nonsense. #144

Closed SornrasakC closed 2 years ago

SornrasakC commented 2 years ago

So, I try to run the inference demo python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "It is well know that deep generative models have a rich latent space!" -i 0

However, I only found the waveglow_256channels_universal_v5 one online, which is why I change the script to this python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a rich latent space!" -i 0

The result I got is this image image

Which had no sound at all.

https://user-images.githubusercontent.com/43643389/142388395-54a01d4d-4bc5-4986-8248-6153c3a95c74.mov

And after I found this comment https://github.com/NVIDIA/flowtron/issues/141#issuecomment-950121008

https://user-images.githubusercontent.com/43643389/142388023-1aaa2d35-aad9-4dee-b562-ed926eca8817.mov

Any ideas where did I gone wrong? (I am doing all these in Colab, if it matters?)

Thank you.

Bahm9919 commented 2 years ago

It's normal, and It's maybe attention problem (or maybe gate problem), How many frames do you provide? Try to provide more or less frames in inference with this options python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a rich latent space!" -i 0 -n 400 - where 400 number of frames and its default number of frames, try to use 200 or 300. And let me know.

SornrasakC commented 2 years ago

Thank you for a reply.

Here's the frames I tried (200 to 500)

https://user-images.githubusercontent.com/43643389/144173687-ed3efc81-e8e4-49c0-b182-5c514f4b30e3.mov

https://user-images.githubusercontent.com/43643389/144173723-ca057e4e-438a-4248-9e15-c2bb4d0afe04.mov

https://user-images.githubusercontent.com/43643389/144173727-68572d88-d323-4c77-9183-45e32e494c1a.mov

https://user-images.githubusercontent.com/43643389/144173730-d16bf6b5-bec8-4cb5-9676-289c4f77cc3e.mov

There's roughly 230 frames for the n400 and 290 frames for n500. And they are all still speaking alien language.

Bahm9919 commented 2 years ago

Check your text cleaners, maybe you use another arphabet or phoneme representation or you didnt include CMU dict.

Thank you for a reply.

Here's the frames I tried (200 to 500)

n200_sid0_sigma0.5.mov n300_sid0_sigma0.5.mov n400_sid0_sigma0.5.mov n500_sid0_sigma0.5.mov There's roughly 230 frames for the n400 and 290 frames for n500. And they are all still speaking alien language.

SornrasakC commented 2 years ago

So, I tried installing a new fresh repo from this main branch without any of my commit, then fix a few lines in inference.py (remove .half(), change 'bottom' to 'lower' ), I can now generate the sound to exactly what the text input is.

Then I came back to my original repo and was suspecting that my symbols.py was the problem, since I removed all the arpabet from them.

After adding all the arpabet back, it's now working just like the fresh one.

Thank you for your help @Bahm9919