Remove bug in position-wise feedforward layer.

@brightmart, thanks for providing all this code, I found it very helpful in understanding the position-wise feedforward networks of the Attention is all you need paper.

The paper says that a 2-layer MLP is applied to each token of the sequence separately. So, from their equation W1 should be d_model x d_ff and W2 should be d_ff x d_model. Thinking about it in terms of convolutions, this means d_ff kernels of size [1,d_model] and then d_model kernels of size [1, d_ff]. Your code, instead, does not make any use of d_ff, therefore it can't be correct.

Please check out my amendments.

brightmart / text_classification

Remove bug in position-wise feedforward layer. #86