THUNLP-MT / THUMT

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group
BSD 3-Clause "New" or "Revised" License
701 stars 197 forks source link

MRT tends to deteriorate the performance while fine tuning a pre-trained Transformer. #78

Closed yongchanghao closed 4 years ago

yongchanghao commented 4 years ago

I suppose this phenomenon reasonable since I have not found records with MRT claiming better performance than Transformers' with MLE. Do you have any empirical evidence corresponding to this? Otherwise I would double check my implementation. XD

Glaceon31 commented 4 years ago

I have seen a report that MRT outperforms MLE on Transformer. However, I cannot find that paper now.

In my attempt to applying MRT on Transformer. The BLEU score slightly increase first (+0.5 BLEU score) and then decreases significantly. I am not sure my implementation is correct either. MRT is very sensitive to learning rate. Tuning the learning rate for MRT is highly recommended.

To conclude, the usefulness of MRT on Transformer is still in doubt. There are reports that MRT is useful on Transformer, but we cannot get the same result with our implementation.

yongchanghao commented 4 years ago

Thank you.