Closed hfxunlp closed 6 years ago
This is an improvement coming from the reference implementation and used as default in their Transformer configuration. See:
Thank you. It really helps. @guillaumekln
This is an improvement coming from the reference implementation and used as default in their Transformer configuration. See:
Hi Guillaumekln, still want to know why it's better... is there any mathematical explanation? Or it's according to the experimental results? Thanks!
I believe it is according to experimental results from the original authors of the paper.
According to line 69, the input to the
TransformerEncoderLayer
will be normalized, but when I see line 120 and line 135, I found that the embedding to the firstTransformerEncoderLayer
will also be normalized. I am not sure whether this is right, since the original paper says nothing about this in section 5.4.