question about the attention layer

Hi Chris,

I am looking at your code for attention layer in classifier, thank you for your high quality code.

I have a question though, looks like you are looking for the attention between each output in the sequence and the last hidden layer. However, I think the hidden state hn is also the last one in the output output[-1] (I think Pytorch and Keras are the same in the definition here).

rnn = nn.GRU(10, 20, 1, bidirectional=False) input = torch.randn(5, 3, 10) h0 = torch.randn(1*1, 3, 20) output, hn = rnn(input, h0)

hn = hn.view(1, 1, 3, 20)[-1].squeeze(0) print(hn) print(output[-1])

Therefore, I am not sure the logic behind the attention here: I do not see any problem when calculating attention between hn and output[0:-1], but the main question is, why we want to calculate attention when it tries to find weight between hn and output[-1], that essentially just hn.dot(hn), and the similarity should be in general the highest.

I am not the expert on this, but this is my doubt. Thank you.

Regards,

chrisvdweth / ml-toolkit

question about the attention layer #2