Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Guided Attention Loss #346

Open begeekmyfriend opened 5 years ago

begeekmyfriend commented 5 years ago

Hi guys, I have been trying guided attention inspired from DC-TTS. Having been added this attention loss the convergence speeds up and at the same step the total loss is even lower than before. So I think it worth trying for all of you. So I opened my mandarin branch for you. Enjoy it. Unfortunately the guided attention has to be set fixed size for the alignment weights where there is always the max length of the text and the mel frames and therefore the training has to be lagged. Any better idea is welcome! As for the illustration of guided attention, please refer to https://github.com/mozilla/TTS/issues/13#issuecomment-384276551

alexdemartos commented 5 years ago

This is a very interesting idea.

I do not quite understand why the guided attention mask needs to be fixed size. As I understand, there should be one guided attention loss per sample (since padding is affecting attention loss). In my opinion, it should be something like:

One way to implement this might be to pass input/output lengths to the guided_attention loss function.

Another interesting idea would be to implement a decaying weight for this loss, since guiding the attention becomes less important as the model learns to align.

begeekmyfriend commented 5 years ago

Your idea sounds brilliant but it is too hard to implement dynamic weighted alignment in tensor with Tensorflow and do alignment operations elementwise. It took me 2 days but I did not work it out. I had to use numpy instead where fixed size is needed.

alexdemartos commented 5 years ago

I can imagine...I am not experienced in Tensorflow so I can't be of much help here. Let's hope for someone to take a closer look to this issue. Thanks anyway for sharing your efforts :)

begeekmyfriend commented 5 years ago

Before step-136000-align After step-19000-align

shahuzi commented 5 years ago

@begeekmyfriend 你好,请问你的attention_loss会有变化吗?我试了一下guide_attention,在我的数据里,attention_loss几乎没什么变化。 attention_loss

begeekmyfriend commented 5 years ago

Guided attention is aimed to quick alignment by supervising the attention formation. And the total loss is less than that without it. By the way, please sync my latest commit fix https://github.com/begeekmyfriend/Tacotron-2/commit/4083fdfdf53aba3bad9f3f2d2bc561f4154fd294

begeekmyfriend commented 5 years ago

Here is my fastest convergence record step-2000-align

zhangyi02 commented 5 years ago

Thanks for your sharing. During my training, I set tacotron_teacher_forcing_mode='scheduled', and losses was very low, but without any alignment. With the decreasing of teacher_forcing_ratio, loss grow up quickly. I wonder how you set your training?

begeekmyfriend commented 5 years ago

@zhangyi02 I am using hyper parameters on my fork. Please checkout whether your dataset is good enough.

zhangyi02 commented 5 years ago

I checked your hparam, but more confused about the result. During training, with/without teacher forcing I can always get much smaller loss than your experiment, but no alignment at all. And eval loss was very high, ~1. It doesn't make sense for overfitting, since I got ~0.3 training loss in the very beginning. I doubt something must be wrong in my code. Do you have any idea help me out of here ? And what do you mean checkout dataset ? What should I do ?

image

@zhangyi02 I am using hyper parameters on my fork. Please checkout whether your dataset is good enough.

begeekmyfriend commented 5 years ago

Lower loss might be overfitting. The scheduled teacher forcing constraint it. I think you need to checkout your dataset or you can use LJSpeech or other open corpus to verify it. Note that my branch is only for Chinese mandarin and therefore you need to change the dictionary.

joan126 commented 3 years ago

guided attention loss use T,N should represent max_text_length and max_mel_frames of overall dataset or of one batch?

begeekmyfriend commented 3 years ago

No it is the average value of all samples. You have to estimate it ahead.

joan126 commented 3 years ago

No it is the average value of all samples. You have to estimate it ahead. In espnet espnet and moliza/TTS repo, N,T represent max_text_length and max_mel_frame of a batch