@brightmart, thanks for providing all this code, I found it very helpful in understanding the position-wise feedforward networks of the Attention is all you need paper.
The paper says that a 2-layer MLP is applied to each token of the sequence separately. So, from their equation W1 should be d_model x d_ff and W2 should be d_ff x d_model. Thinking about it in terms of convolutions, this means d_ff kernels of size [1,d_model] and then d_model kernels of size [1, d_ff]. Your code, instead, does not make any use of d_ff, therefore it can't be correct.
@brightmart, thanks for providing all this code, I found it very helpful in understanding the position-wise feedforward networks of the
Attention is all you need
paper.The paper says that a 2-layer MLP is applied to each token of the sequence separately. So, from their equation W1 should be d_model x d_ff and W2 should be d_ff x d_model. Thinking about it in terms of convolutions, this means d_ff kernels of size [1,d_model] and then d_model kernels of size [1, d_ff]. Your code, instead, does not make any use of d_ff, therefore it can't be correct.
Please check out my amendments.