I am looking at your code for attention layer in classifier, thank you for your high quality code.
I have a question though, looks like you are looking for the attention between each output in the sequence and the last hidden layer. However, I think the hidden state hn is also the last one in the output output[-1] (I think Pytorch and Keras are the same in the definition here).
Therefore, I am not sure the logic behind the attention here: I do not see any problem when calculating attention between hn and output[0:-1], but the main question is, why we want to calculate attention when it tries to find weight between hn and output[-1], that essentially just hn.dot(hn), and the similarity should be in general the highest.
I am not the expert on this, but this is my doubt. Thank you.
Hi Chris,
I am looking at your code for attention layer in classifier, thank you for your high quality code.
I have a question though, looks like you are looking for the attention between each output in the sequence and the last hidden layer. However, I think the hidden state
hn
is also the last one in the outputoutput[-1]
(I think Pytorch and Keras are the same in the definition here).rnn = nn.GRU(10, 20, 1, bidirectional=False)
input = torch.randn(5, 3, 10)
h0 = torch.randn(1*1, 3, 20)
output, hn = rnn(input, h0)
hn = hn.view(1, 1, 3, 20)[-1].squeeze(0)
print(hn)
print(output[-1])
Therefore, I am not sure the logic behind the attention here: I do not see any problem when calculating attention between
hn
andoutput[0:-1]
, but the main question is, why we want to calculate attention when it tries to find weight betweenhn
andoutput[-1]
, that essentially justhn.dot(hn)
, and the similarity should be in general the highest.I am not the expert on this, but this is my doubt. Thank you.
Regards,