brightmart / text_classification

all kinds of text classification models and more with deep learning
MIT License
7.83k stars 2.57k forks source link

Remove bug in position-wise feedforward layer. #86

Closed jannisborn closed 5 years ago

jannisborn commented 5 years ago

@brightmart, thanks for providing all this code, I found it very helpful in understanding the position-wise feedforward networks of the Attention is all you need paper.

The paper says that a 2-layer MLP is applied to each token of the sequence separately. So, from their equation W1 should be d_model x d_ff and W2 should be d_ff x d_model. Thinking about it in terms of convolutions, this means d_ff kernels of size [1,d_model] and then d_model kernels of size [1, d_ff]. Your code, instead, does not make any use of d_ff, therefore it can't be correct.

Please check out my amendments.