NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
889 stars 177 forks source link

Bad attention weights #43

Open dmitrii-obukhov opened 4 years ago

dmitrii-obukhov commented 4 years ago

Hello

I am trying to train Flowtron on LJSpeech Unfortunately after 24 hours of training, the attention weights still bad Server configuration: 4 instances with 8xV100

image

image

Do you have any ideas?

rafaelvalle commented 4 years ago

Yes, train it by progressively adding steps of flow as the model learns to attend on each step of flow. Start with 1 step of flow, train it until it learns attention, use this model to warm-start a model with 2 steps of flow, and so on...

dmitrii-obukhov commented 4 years ago

@rafaelvalle Thanks for your reply

I ran new training with option model_config.n_flows=1, but after 16 hours attention weights look bad again:

image

In one of the threads I read that good alignment is produced in less than 24 hours.

So, what could be wrong?

rafaelvalle commented 4 years ago

Can you share your tensorboard plots?

dmitrii-obukhov commented 4 years ago

Yes

image

image

rafaelvalle commented 4 years ago

Does it have good attention around 60k iters ?

dmitrii-obukhov commented 4 years ago

No. Attention on all iterations looks the same

rafaelvalle commented 4 years ago

Make sure you trim silences from the beginning and end of your audio files

dmitrii-obukhov commented 4 years ago

I use LJSpeech dataset for training. Any instructions on how to trim them?

Could the problem be that I use distributed training?

Also, I set the flag fp16_run=true

adrianastan commented 4 years ago

Make sure you trim silences from the beginning and end of your audio files

Should there be no silence at all in the beginning and end, or should there be at least, let's say 0.1 seconds of silence?

adrianastan commented 4 years ago

I use LJSpeech dataset for training. Any instructions on how to trim them?

The simplest way would be to use librosa.effects.trim()

rafaelvalle commented 4 years ago

there should be no silence at all at the beginning and end of each audio file. sox and librosa.effects.trim can be used to trim silences from beginning and end

kurbobo commented 4 years ago

I have similar problem

I use LJSpeech dataset for training. Any instructions on how to trim them?

Could the problem be that I use distributed training?

Also, I set the flag fp16_run=true

Have you solved this problem? Also I tried to predict not mel spectrogram but lpc spectrogram but always got such picture, does anybody know what is the problem? image

dmitrii-obukhov commented 4 years ago

The problem remained unresolved. I tried to trim silences from the beginning and end of audio files with the librosa.effects.trim(), but the picture remains the same.

rafaelvalle commented 4 years ago

@kurbob does the attention map always look like that? You might have to change from byte to bool https://github.com/NVIDIA/flowtron/blob/master/flowtron.py#L33

rafaelvalle commented 4 years ago

@adrianastan There should be no silence at the beginning or at the end of an audio file.

zjFFFFFF commented 4 years ago

@rafaelvalle Can you tell me what the meaning of no silence is? If i use librosa.effects.trim(), top_db should be set to what. For my data set, if I set top_db to 20, some sounds will also be cut off. setting it a little higher, it seems that some audio files still have silence at the beginning.

dmitrii-obukhov commented 4 years ago

@kurbobo The problem was solved when I used encoder and embedding layers from pretrained model.

@zjFFFFFF In my case top_db = 30 works well enough

zjFFFFFF commented 4 years ago

@dLeos

In fact, I got the same plot as you(training from scratch). But the validation loss does not seem to affect the experimental results. Between iterations: 800,000-950,000(when the iteration is 1,000,000, i can't get acceptable results), the model can generate acceptable sounds. So you can try different checkpoints one by one.

kurbobo commented 4 years ago

@kurbob does the attention map always look like that? You might have to change from byte to bool https://github.com/NVIDIA/flowtron/blob/master/flowtron.py#L33 @rafaelvalle No, it's not always appears and I had already fixed the "type/bool" problem before train, but nevertheless sometimes such problem happens. I have one more question: am I right, that flowtron in this repo converts every sentence to arpabet transciption and then train to map sequence of arpabet transcriptions to sqauence of frequency frames?

kurbobo commented 4 years ago

@kurbob does the attention map always look like that? You might have to change from byte to bool https://github.com/NVIDIA/flowtron/blob/master/flowtron.py#L33 @rafaelvalle No, it's not always appears and I had already fixed the "type/bool" problem before train, but nevertheless sometimes such problem happens. I have one more question: am I right, that flowtron in this repo converts every sentence to arpabet transciption and then train to map sequence of arpabet transcriptions to sqauence of frequency frames?

@rafaelvalle

Liujingxiu23 commented 4 years ago

@kurbobo @rafaelvalle I tried mels , train n-flows =1 first and then use the model to warmup n-flows = 2 model , the two alignment weights are both right, and the wavs synthsized are good. But to lpc parameter that used in lpcnet vocoder, when n-flows=1, everythings seems good, loss is good, alignment is right, however when I train n-flows=2 with the trained n-flows=1 model as warmup, the second alignment failed, and the loss just vibrate without any descend.

rafaelvalle commented 4 years ago

@Liujingxiu23 please share training, validations losses and attention for 1 step of flow model and 2 steps of flow model.

rafaelvalle commented 4 years ago

Did you warmup the 2 flows model with the 1 flow model from a checkpoint around 200k?