Closed astonzhang closed 3 years ago
Hmm, I can't see the root cause here. If you check the stable version that seems to be doing very well. http://d2l.ai/chapter_attention-mechanisms/transformer.html#training
Could it be just this run or the change of hyperparams? I'll try to investigate more.
In the stable version, both mx & pt perform poorly because there are issues in the initial implementations. Now they're fixed, but the pt version still didn't perform as well as the mx version:
MXNet: go . => va !, bleu 1.000 i lost . => j'ai perdu ., bleu 1.000 i'm home . => je suis chez moi ., bleu 1.000 he's calm . => il est malade ., bleu 0.658
PyTorch:
go . => c'est le
One hint is that mx: loss 0.030 pt: loss 0.016
These are quite different
I checked again and PyTorch is actually performing better than MX in this case if you look closely bleu score for the 1st example is perfect in PT unlike MX.
MX
go . => va le chercher !, bleu 0.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => je suis chez moi ., bleu 1.000
he's calm . => il est bon ., bleu 0.658
PT
go . => va !, bleu 1.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => je suis <unk> ., bleu 0.512
he's calm . => il est mouillé ., bleu 0.658
The loss difference is not a lot if you are hinting towards overfitting in PT.
Thanks. Could you run experiments for multiple runs to see if mx and pt results are comparable?
Run 1: loss 0.014, 5796.9 tokens/sec on cpu
go . => entrez !, bleu 0.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => j'en suis chez moi ., bleu 0.832
he's calm . => il est calme ., bleu 1.000
Run 2: loss 0.014, 5826.1 tokens/sec on cpu
go . => <unk> !, bleu 0.000
i lost . => je suis perdu ., bleu 0.537
i'm home . => je suis chez moi ., bleu 1.000
he's calm . => je suis calme ., bleu 0.537
Run 3: loss 0.016, 5568.3 tokens/sec on cpu
go . => <unk> !, bleu 0.000
i lost . => je sais perdu ., bleu 0.537
i'm home . => je suis chez moi ., bleu 1.000
he's calm . => il est paresseux ., bleu 0.658
Run 4: loss 0.018, 6071.4 tokens/sec on cpu
go . => vas-y !, bleu 0.000
i lost . => je <unk> ., bleu 0.000
i'm home . => je suis partie ., bleu 0.512
he's calm . => <unk> ., bleu 0.000
Run 5: loss 0.017, 4642.7 tokens/sec on cpu
go . => va !, bleu 1.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => je suis chez <unk> ., bleu 0.752
he's calm . => je suis calme !, bleu 0.000
I guess there is indeed some issue with PT tranformer chapter since seq2seq has comparable results to mx but the transformer results are not very consistent. Thanks for flagging this @astonzhang , I'll try to find and fix the issue. @ypandya Would be nice if you can also take a look. Thanks!
Just one more example of inconsistent performance
First, torch.repeat should be used instead of torch.repeat_interleave. Someone pointed this out in the discussion of https://d2l.ai/chapter_attention-mechanisms/transformer.html Second, pytorch version predicts differently with the same input. I think it's wired. Mxnet version doesn't have this issue.
Please see my comments on #1528 for the bug fix and reasoning. Once the PR is merged, it will trigger a close on this issue. Thanks!
Now both look better to me!
http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_attention-mechanisms/transformer.html
PyTorch:
go . => !, bleu 0.000
i lost . => je l'ai vu ., bleu 0.000
so long . => , bleu 0.000
i'm home . => à la maison ., bleu 0.000
he's calm . => sois , bleu 0.000
MXNet:
go . => va !, bleu 1.000 i lost . => j'ai perdu ., bleu 1.000 so long . => !, bleu 0.000
i'm home . => je suis chez chez vous en , bleu 0.376
he's calm . => il calme ., bleu 0.658
Though the translation looks good in other sections: http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_recurrent-modern/seq2seq.html http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_attention-mechanisms/seq2seq-attention.html