NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
889 stars 177 forks source link

Unconditioned Flowtron #47

Open adrianastan opened 4 years ago

adrianastan commented 4 years ago

Hi,

Did anybody try to train the Flowtron flow architecture in an unconditioned manner, for density estimation for example? If so, any hints and tips you could share?

Thanks!

rafaelvalle commented 4 years ago

Train a model with 1 step of flow first. Then use this model to warm-start a model with 2 steps of flow.

adrianastan commented 4 years ago

Hi,

Thanks for your reply. I indeed started training a 1-flow using the LibriSpeech train-clean-100 data using a modified unconditioned version of Flowtron. I then used the trained flow to warm-start a 2-flow architecture. However at inference there is nothing but noise: https://drive.google.com/file/d/1V7sX3Ma3RFBo6lNSCUxSsNjP3Y_HmAZo/view?usp=sharing.

sid0_sigma0 5

I was expecting at least some babble noise.

Any hints on when is a goot point to start the second-flow training? Should I train more? Should I lower the learning rate? Below are the loss curves for the 1st flow:

Screenshot from 2020-07-29 14-18-24

Thanks!

rafaelvalle commented 4 years ago

The validation loss for your 1-step of flow model is starting to plateau. Use this model to warm-start a 2-steps of flow model. I assume the validation loss will go down. You can alternatively try the same experiment on LJS.

adrianastan commented 4 years ago

I warmstarted a 2 flow model from the 1 flow weights and continued training. Training and validation losses are as below:

2flows

2flows_sid0_sigma0 5

Still no speech-like output at inference. https://drive.google.com/file/d/19OC2cSfPgfvrS0mrRx73bkLLKp0yt0v8/view?usp=sharing

I additionally started a subsequent 3 flow model, as well:

3flows

The output is as follows:

3flows_sid0_sigma0 5

https://drive.google.com/file/d/1F7lXcEqx5_gqMDog4KgyahfKDGx7-175/view?usp=sharing

So I assume that this architecture might not be complex enough to estimate a multispeaker latent space. I will try to do the same thing on LJSpeech -- perhaps the conditions are simpler.

Thanks!

rafaelvalle commented 4 years ago

@adrianastan if you trained a model with speaker embeddings, what happens if do this: flowtron.infer(flowtron.forward(audio, speaker), other_speaker)

adrianastan commented 4 years ago

I did not use speaker embeddings, just a multispeaker dataset. I removed all conditionings of the flow.