MRT tends to deteriorate the performance while fine tuning a pre-trained Transformer.

yongchanghao commented 4 years ago

I suppose this phenomenon reasonable since I have not found records with MRT claiming better performance than Transformers' with MLE. Do you have any empirical evidence corresponding to this? Otherwise I would double check my implementation. XD

Glaceon31 commented 4 years ago

I have seen a report that MRT outperforms MLE on Transformer. However, I cannot find that paper now.

In my attempt to applying MRT on Transformer. The BLEU score slightly increase first (+0.5 BLEU score) and then decreases significantly. I am not sure my implementation is correct either. MRT is very sensitive to learning rate. Tuning the learning rate for MRT is highly recommended.

To conclude, the usefulness of MRT on Transformer is still in doubt. There are reports that MRT is useful on Transformer, but we cannot get the same result with our implementation.

yongchanghao commented 4 years ago

Thank you.

THUNLP-MT / THUMT

MRT tends to deteriorate the performance while fine tuning a pre-trained Transformer. #78