Open kaharjan opened 4 years ago
Hey @kaharjan ! If we carefully read the paper (https://arxiv.org/abs/1706.03762), in section 3.1, it states "We employ a residual connection around each of the two sub-layers, followed by layer normalization". Hence, it should be an 'Add' followed by a 'Norm', i.e.,
There is not an extra "X + ". Please ask your question on our forum (https://discuss.d2l.ai/)
In The Annotated Transformer the residual connection is implemented like that :
x + self.dropout(sublayer(self.norm(x)))
However, in the section 9.3.3, it is implemented in this way:self.norm(self.dropout(Y) + X)
. I think it should beX + self.norm(self.dropout(Y))
am I correct?