Why AddNorm() in the Transformer 9.3.3 (line: `self.norm(self.dropout(Y) + X)` ) is different from the Original one (line :` x + self.dropout(sublayer(self.norm(x))) ` ) ?

d2l-ai / d2l-en

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.

https://D2L.ai

Other

23.92k stars 4.36k forks source link

Why AddNorm() in the Transformer 9.3.3 (line: `self.norm(self.dropout(Y) + X)` ) is different from the Original one (line :` x + self.dropout(sublayer(self.norm(x))) ` ) ? #803

Open kaharjan opened 4 years ago

kaharjan commented 4 years ago

In The Annotated Transformer the residual connection is implemented like that : x + self.dropout(sublayer(self.norm(x))) However, in the section 9.3.3, it is implemented in this way: self.norm(self.dropout(Y) + X) . I think it should be X + self.norm(self.dropout(Y)) am I correct?

goldmermaid commented 4 years ago

Hey @kaharjan ! If we carefully read the paper (https://arxiv.org/abs/1706.03762), in section 3.1, it states "We employ a residual connection around each of the two sub-layers, followed by layer normalization". Hence, it should be an 'Add' followed by a 'Norm', i.e.,

Add: self.dropout(Y) + X
Norm: self.norm()

There is not an extra "X + ". Please ask your question on our forum (https://discuss.d2l.ai/)