as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
578 stars 113 forks source link

Extracting Alignment from Tacotron - Cherry Pick? #52

Open sbuser opened 3 years ago

sbuser commented 3 years ago

If I'm following along correctly, it looks to me like the model in train_tacotron is only used to extract the alignment layer which is then saved and used in train_foward's model.

When using train_tacotron on a single speaker dataset of ~100k English utterances, I'm seeing a divergence between Val/Loss and Val/Attention_Score around step 15,000 (batch size 22). Val/Loss keeps decreasing, but Val/Attention_Score starts to drop as well. This continues down through my modified training schedule (which I created after seeing this in the original schedule).

It doesn't look to me like the alignments are cherry picked from the model with the best Val/Attention_Score? I can't think of a downside to implementing that? Or am I missing something?

Was the original schedule with changes at distances of 10k steps based on the ~10k utterances in ljspeech, and would you suggest I dramatically increase the steps for my data? Or was the original schedule just the result of tuning/experimentation?

Any ideas what might be causing the divergence around step 15k? Thinking it was simple overfitting I've tried increasing dropout significantly but still see the same overall phenomenon.

sbuser commented 3 years ago

Further testing shows that Val/Attention_Score starts to deviate when r drops according to the schedule (eg down from 5 to 3). It seems like my data set doesn't play well with the shifting outputs_per_step.

cschaefer26 commented 3 years ago

Hi, what's the exact schedule you are using? Usually I see a slight drop in the attention score over time, but not too much. Its also questionable whether a higher score is necessarily better, it just needs to be decent imo. In my experience it is just important to get the Tacotorn down to r=1 reduction with no attention breaks. If you post your tensorboard plot I could probably give a hint about the schedule to use. Sometimes it requires a bit of tweaking (also check if your samples contain a lot of silences).