Implementation request: Attention GRU

juesato commented 7 years ago

@nicholas-leonard I'm probably going to write a GRU with attention. I'm curious to get your input on the best way to do this. I'm also happy to contribute it here if you want.

The first option is to modify the current GRU implementation, where every time step takes 3 inputs rather than 2, {x_t, h_t-1, enc} where enc are the encodings being attended over.

I'm not particularly satisfied with this since there's a lot of almost duplicated code in handling the forwards / backwards passes. I don't think this is too bad, since this is already done across GRU / LSTM. It would also be slower, and wouldn't be possible to do the trick where the multiplication of the encoding states to the embedding hidden state is only done once. I'm not sure how important this is, since it seems like anything actually really speed critical needs to be written at a lower level anyways.

The other option would be to write it as a single module, like SeqGRUAttention. Seems to involve a lot of code redundancy for similar reasons. But this way, it wouldn't have to worry about playing nicely with Sequencer / repeating the boilerplate in GRU.lua. I think the major disadvantage of this approach is that it's less transparent what's going on, since the gradients are computed by hand.

I'm slightly leaning towards the second.

gaosh commented 7 years ago

I implemented the temporal attention model with LSTM from Describing Videos by Exploiting Temporal Structure. I made it like SeqLSTMAttention, you can also take a look at this post.

juesato commented 7 years ago

This doesn't seem the same, unless I'm missing something. I want attention to be integrated into the internal dynamics of the GRU, and this module takes an encoding and a hidden state and gives you weights, but it would still need to be integrated with a GRU. If there's a clean way to do so, I'd be interested.

gaosh commented 7 years ago

Can you show me which paper you want to implement exactly?

juesato commented 7 years ago

Sure, either Bahdanau 2014 or Luong 2015 would do.

Element-Research / rnn

Implementation request: Attention GRU #387