it hard to tell this is the R-Net

HKUST-KnowComp / R-Net

Tensorflow Implementation of R-Net

MIT License

578 stars 212 forks source link

it hard to tell this is the R-Net #4

Closed Levstyle closed 5 years ago

Levstyle commented 6 years ago

They have an essential difference between the model proposed by u and the one proposed by MS.

The key components of R-Net are missed in ur model, though ur model is effective

It's really hard to train the original RNET model

wzhouad commented 6 years ago

I don't agree. The key contribution of R-Net is self attention, the only difference between mine and MS is the attention mechanism, which is not so important. Different attention tends to achieve similar results. It's also a memory issue. The additive attention in R-Net will triple the memory use. Also it's not hard to train the original R-Net, my original implementation can easily reach 77% or above (with batch size 32), though it may converge slower. There may be some errors in your implementation. Training RNN is very tricky, I suggest you read my code first if you insist on the original R-Net.

Levstyle commented 6 years ago

Thanks for ur suggestions! The additive attention makes Rnet converge slow!

rajarsheem commented 6 years ago

Another difference might be this: the gating after the attention is used in the original R-Net paper only in the query-passage attention layer and not in the self attention layer.

This implementation uses the same in both the layers as it re-uses the same attention function. I am not sure what effect does this reuse have on the results.

wzhouad commented 6 years ago

The paper uses gated attention in both layers. "An additional gate as in gated attention-based recurrent networks s is applied to [vPt, ct] to adaptively control the input of RNN." For the second question, the two attentions are in different variable scope so there is no reuse.

robbine commented 6 years ago

I am also confused with the gate mechanism. In the paper 'Attention is all you need', a normalizer layer is added after multi-head attention layer. while in R-Net, a gate is added after the attention layer. and then, there is another gru network. so Why is that?

zsweet commented 5 years ago

I am also confused with the gate mechanism. In the paper 'Attention is all you need', a normalizer layer is added after multi-head attention layer. while in R-Net, a gate is added after the attention layer. and then, there is another gru network. so Why is that?

In my opinion, the different between rnn gate and transformer redidual is just like highway network and residual.Highway network use trainable parameters as gate while residual just add the layer result.