Hi, I found that the weights computed by nn.Attention is always 1 in example:
atten = nn.Attention(3, attention_type='dot')a = Variable(torch.randn(1, 1, 3))b = Variable(torch.randn(1, 2, 3))output, weights = atten(a, b)
and the output is:
Variable containing: (0 ,.,.) = 1 1 [torch.FloatTensor of size 1x1x2]
It is always 1 whatever I feed into attention layer with[1, n, m] that is batch size is always 1.
Then I check the code found that the nn.Softmax(dims=0) In my opinion this operator does softmax along the first dimension(if a tensor's size is 3 like [z, y, x], it done along the z dimension). But the first dimension is batch_size * output_len, so it does wrong direction softmax cuz it should do it along last dims query_len which contains real scores for every context in a inst.
E.g.
given a query vector [1, 1, 3] and the context [1, 2, 3] that is context is a 2 rows, 3 column matrix . Then, attention context will be transposed to [1, 3, 2], scoring and get scores [1, 1, 2]. As your operation, scores are reshaped to [1, 2] and do softmax along the first dims. Finally, the weights on it are always 1.
I changed it to nn.Softmax(dims=-1), which operates along the last dims(query_len), it worked:
Variable containing: (0 ,.,.) = 0.1426 0.8574 [torch.FloatTensor of size 1x1x2]
Hi, I found that the weights computed by
nn.Attention
is always 1 in example:atten = nn.Attention(3, attention_type='dot')
a = Variable(torch.randn(1, 1, 3))
b = Variable(torch.randn(1, 2, 3))
output, weights = atten(a, b)
and the output is:Variable containing: (0 ,.,.) = 1 1 [torch.FloatTensor of size 1x1x2]
It is always 1 whatever I feed into attention layer with[1, n, m] that is batch size is always 1. Then I check the code found that thenn.Softmax(dims=0)
In my opinion this operator does softmax along the first dimension(if a tensor's size is 3 like [z, y, x], it done along the z dimension). But the first dimension isbatch_size * output_len
, so it does wrong direction softmax cuz it should do it along last dimsquery_len
which contains real scores for every context in a inst.given a query vector
[1, 1, 3]
and the context[1, 2, 3]
that is context is a 2 rows, 3 column matrix . Then, attention context will be transposed to[1, 3, 2]
, scoring and get scores[1, 1, 2]
. As your operation, scores are reshaped to[1, 2]
and do softmax along the first dims. Finally, the weights on it are always 1.I changed it to
nn.Softmax(dims=-1)
, which operates along the last dims(query_len
), it worked:Variable containing: (0 ,.,.) = 0.1426 0.8574 [torch.FloatTensor of size 1x1x2]
Thx for your great work -Chen