What Makes The Result Rise From 17.7 to 22.4 In Comparison with The Previous Version?

Do you know what are the key factors that makes the BLEU score rise from 17.7 (the previous version in TF1.2) to 22.5 (the current version/master branch)? If I want to develop my model based on the previous version, how should I modify the code?

Seems that the author clarifies that two major differences are (1)revising known bugs. (masking, positional encoding, ...) (2)adding some missing components (bpe, shared weight matrix, ...). But I have checked that the masking part seems to be the same as it was in the provious version. I believe that adding bpe is very helpful. Could anyone clarify what are other key factors that improve the result, if you have conducted some experiments on that?

Kyubyong / transformer

What Makes The Result Rise From 17.7 to 22.4 In Comparison with The Previous Version? #98