d2l-ai / d2l-en

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
https://D2L.ai
Other
23.92k stars 4.36k forks source link

PT implementation of Transformer has very bad translation results #1484

Closed astonzhang closed 3 years ago

astonzhang commented 4 years ago

http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_attention-mechanisms/transformer.html

PyTorch:

go . => !, bleu 0.000 i lost . => je l'ai vu ., bleu 0.000 so long . => , bleu 0.000 i'm home . => à la maison ., bleu 0.000 he's calm . => sois , bleu 0.000

MXNet:

go . => va !, bleu 1.000 i lost . => j'ai perdu ., bleu 1.000 so long . => !, bleu 0.000 i'm home . => je suis chez chez vous en , bleu 0.376 he's calm . => il calme ., bleu 0.658

Though the translation looks good in other sections: http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_recurrent-modern/seq2seq.html http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_attention-mechanisms/seq2seq-attention.html

AnirudhDagar commented 4 years ago

Hmm, I can't see the root cause here. If you check the stable version that seems to be doing very well. http://d2l.ai/chapter_attention-mechanisms/transformer.html#training

Could it be just this run or the change of hyperparams? I'll try to investigate more.

astonzhang commented 4 years ago

In the stable version, both mx & pt perform poorly because there are issues in the initial implementations. Now they're fixed, but the pt version still didn't perform as well as the mx version:

http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_attention-mechanisms/transformer.html

MXNet: go . => va !, bleu 1.000 i lost . => j'ai perdu ., bleu 1.000 i'm home . => je suis chez moi ., bleu 1.000 he's calm . => il est malade ., bleu 0.658

PyTorch: go . => c'est le !, bleu 0.000 i lost . => je suis paresseuse ., bleu 0.000 i'm home . => je suis chez moi ., bleu 1.000 he's calm . => je suis calme ., bleu 0.537

astonzhang commented 4 years ago

One hint is that mx: loss 0.030 pt: loss 0.016

These are quite different

AnirudhDagar commented 4 years ago

I checked again and PyTorch is actually performing better than MX in this case if you look closely bleu score for the 1st example is perfect in PT unlike MX.

MX

go . => va le chercher !, bleu 0.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => je suis chez moi ., bleu 1.000
he's calm . => il est bon ., bleu 0.658

PT

go . => va !, bleu 1.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => je suis <unk> ., bleu 0.512
he's calm . => il est mouillé ., bleu 0.658

The loss difference is not a lot if you are hinting towards overfitting in PT.

astonzhang commented 4 years ago

Thanks. Could you run experiments for multiple runs to see if mx and pt results are comparable?

AnirudhDagar commented 4 years ago

Run 1: loss 0.014, 5796.9 tokens/sec on cpu

go . => entrez !, bleu 0.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => j'en suis chez moi ., bleu 0.832
he's calm . => il est calme ., bleu 1.000

Run 2: loss 0.014, 5826.1 tokens/sec on cpu

go . => <unk> !, bleu 0.000
i lost . => je suis perdu ., bleu 0.537
i'm home . => je suis chez moi ., bleu 1.000
he's calm . => je suis calme ., bleu 0.537

Run 3: loss 0.016, 5568.3 tokens/sec on cpu

go . => <unk> !, bleu 0.000
i lost . => je sais perdu ., bleu 0.537
i'm home . => je suis chez moi ., bleu 1.000
he's calm . => il est paresseux ., bleu 0.658

Run 4: loss 0.018, 6071.4 tokens/sec on cpu

go . => vas-y !, bleu 0.000
i lost . => je <unk> ., bleu 0.000
i'm home . => je suis partie ., bleu 0.512
he's calm . => <unk> ., bleu 0.000

Run 5: loss 0.017, 4642.7 tokens/sec on cpu

go . => va !, bleu 1.000
i lost . => j'ai perdu ., bleu 1.000
i'm home . => je suis chez <unk> ., bleu 0.752
he's calm . => je suis calme !, bleu 0.000

I guess there is indeed some issue with PT tranformer chapter since seq2seq has comparable results to mx but the transformer results are not very consistent. Thanks for flagging this @astonzhang , I'll try to find and fix the issue. @ypandya Would be nice if you can also take a look. Thanks!

astonzhang commented 4 years ago

Just one more example of inconsistent performance

Screen Shot 2020-10-22 at 12 59 49 PM Screen Shot 2020-10-22 at 12 59 43 PM

315930399 commented 4 years ago

First, torch.repeat should be used instead of torch.repeat_interleave. Someone pointed this out in the discussion of https://d2l.ai/chapter_attention-mechanisms/transformer.html Second, pytorch version predicts differently with the same input. I think it's wired. Mxnet version doesn't have this issue.

AnirudhDagar commented 3 years ago

Please see my comments on #1528 for the bug fix and reasoning. Once the PR is merged, it will trigger a close on this issue. Thanks!

astonzhang commented 3 years ago

Now both look better to me!

Screen Shot 2020-11-20 at 4 28 00 PM

Screen Shot 2020-11-20 at 4 28 03 PM