Open begeekmyfriend opened 5 years ago
This is a very interesting idea.
I do not quite understand why the guided attention mask needs to be fixed size. As I understand, there should be one guided attention loss per sample (since padding is affecting attention loss). In my opinion, it should be something like:
One way to implement this might be to pass input/output lengths to the guided_attention loss function.
Another interesting idea would be to implement a decaying weight for this loss, since guiding the attention becomes less important as the model learns to align.
Your idea sounds brilliant but it is too hard to implement dynamic weighted alignment in tensor with Tensorflow and do alignment operations elementwise. It took me 2 days but I did not work it out. I had to use numpy instead where fixed size is needed.
I can imagine...I am not experienced in Tensorflow so I can't be of much help here. Let's hope for someone to take a closer look to this issue. Thanks anyway for sharing your efforts :)
Before After
@begeekmyfriend 你好,请问你的attention_loss
会有变化吗?我试了一下guide_attention
,在我的数据里,attention_loss
几乎没什么变化。
Guided attention is aimed to quick alignment by supervising the attention formation. And the total loss is less than that without it. By the way, please sync my latest commit fix https://github.com/begeekmyfriend/Tacotron-2/commit/4083fdfdf53aba3bad9f3f2d2bc561f4154fd294
Here is my fastest convergence record
Thanks for your sharing. During my training, I set tacotron_teacher_forcing_mode='scheduled'
, and losses was very low, but without any alignment. With the decreasing of teacher_forcing_ratio, loss grow up quickly. I wonder how you set your training?
@zhangyi02 I am using hyper parameters on my fork. Please checkout whether your dataset is good enough.
I checked your hparam, but more confused about the result. During training, with/without teacher forcing I can always get much smaller loss than your experiment, but no alignment at all. And eval loss was very high, ~1. It doesn't make sense for overfitting, since I got ~0.3 training loss in the very beginning. I doubt something must be wrong in my code. Do you have any idea help me out of here ? And what do you mean checkout dataset ? What should I do ?
@zhangyi02 I am using hyper parameters on my fork. Please checkout whether your dataset is good enough.
Lower loss might be overfitting. The scheduled teacher forcing constraint it. I think you need to checkout your dataset or you can use LJSpeech or other open corpus to verify it. Note that my branch is only for Chinese mandarin and therefore you need to change the dictionary.
guided attention loss use T,N should represent max_text_length and max_mel_frames of overall dataset or of one batch?
No it is the average value of all samples. You have to estimate it ahead.
Hi guys, I have been trying guided attention inspired from DC-TTS. Having been added this attention loss the convergence speeds up and at the same step the total loss is even lower than before. So I think it worth trying for all of you. So I opened my mandarin branch for you. Enjoy it. Unfortunately the guided attention has to be set fixed size for the alignment weights where there is always the max length of the text and the mel frames and therefore the training has to be lagged. Any better idea is welcome! As for the illustration of guided attention, please refer to https://github.com/mozilla/TTS/issues/13#issuecomment-384276551