Confusion about the `LayerNorm`s in `TransformerEncoder`

OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch

https://opennmt.net/

MIT License

6.76k stars 2.25k forks source link

Confusion about the `LayerNorm`s in `TransformerEncoder` #770

Closed hfxunlp closed 6 years ago

hfxunlp commented 6 years ago

According to line 69, the input to the TransformerEncoderLayer will be normalized, but when I see line 120 and line 135, I found that the embedding to the first TransformerEncoderLayer will also be normalized. I am not sure whether this is right, since the original paper says nothing about this in section 5.4.

guillaumekln commented 6 years ago

This is an improvement coming from the reference implementation and used as default in their Transformer configuration. See:

https://github.com/tensorflow/tensor2tensor/blob/v1.6.5/tensor2tensor/layers/common_hparams.py#L110-L112

https://github.com/tensorflow/tensor2tensor/blob/v1.6.5/tensor2tensor/models/transformer.py#L1133-L1134

hfxunlp commented 6 years ago

Thank you. It really helps. @guillaumekln

Tristan-J commented 5 years ago

This is an improvement coming from the reference implementation and used as default in their Transformer configuration. See:

https://github.com/tensorflow/tensor2tensor/blob/v1.6.5/tensor2tensor/layers/common_hparams.py#L110-L112

https://github.com/tensorflow/tensor2tensor/blob/v1.6.5/tensor2tensor/models/transformer.py#L1133-L1134

Hi Guillaumekln, still want to know why it's better... is there any mathematical explanation? Or it's according to the experimental results? Thanks!

guillaumekln commented 5 years ago

I believe it is according to experimental results from the original authors of the paper.