Closed Levstyle closed 5 years ago
I don't agree. The key contribution of R-Net is self attention, the only difference between mine and MS is the attention mechanism, which is not so important. Different attention tends to achieve similar results. It's also a memory issue. The additive attention in R-Net will triple the memory use. Also it's not hard to train the original R-Net, my original implementation can easily reach 77% or above (with batch size 32), though it may converge slower. There may be some errors in your implementation. Training RNN is very tricky, I suggest you read my code first if you insist on the original R-Net.
Thanks for ur suggestions! The additive attention makes Rnet converge slow!
Another difference might be this: the gating after the attention is used in the original R-Net paper only in the query-passage attention layer and not in the self attention layer.
This implementation uses the same in both the layers as it re-uses the same attention function. I am not sure what effect does this reuse have on the results.
The paper uses gated attention in both layers. "An additional gate as in gated attention-based recurrent networks s is applied to [vPt, ct] to adaptively control the input of RNN." For the second question, the two attentions are in different variable scope so there is no reuse.
I am also confused with the gate mechanism. In the paper 'Attention is all you need', a normalizer layer is added after multi-head attention layer. while in R-Net, a gate is added after the attention layer. and then, there is another gru network. so Why is that?
I am also confused with the gate mechanism. In the paper 'Attention is all you need', a normalizer layer is added after multi-head attention layer. while in R-Net, a gate is added after the attention layer. and then, there is another gru network. so Why is that?
In my opinion, the different between rnn gate and transformer redidual is just like highway network and residual.Highway network use trainable parameters as gate while residual just add the layer result.
They have an essential difference between the model proposed by u and the one proposed by MS.
The key components of R-Net are missed in ur model, though ur model is effective
It's really hard to train the original RNET model