Closed yongchanghao closed 4 years ago
I have seen a report that MRT outperforms MLE on Transformer. However, I cannot find that paper now.
In my attempt to applying MRT on Transformer. The BLEU score slightly increase first (+0.5 BLEU score) and then decreases significantly. I am not sure my implementation is correct either. MRT is very sensitive to learning rate. Tuning the learning rate for MRT is highly recommended.
To conclude, the usefulness of MRT on Transformer is still in doubt. There are reports that MRT is useful on Transformer, but we cannot get the same result with our implementation.
Thank you.
I suppose this phenomenon reasonable since I have not found records with MRT claiming better performance than Transformers' with MLE. Do you have any empirical evidence corresponding to this? Otherwise I would double check my implementation. XD