Closed APCunha closed 1 year ago
Hi @APCunha, thanks for the question, good point. We used softmax=T as default for get attention as this also works and is a bit more efficient - it's indeed not the same as f1 if calling the get_instance_attention directly. I suggest you open a PR with your change as it nicely entails both use cases (and removes the need for manual calls to forward attention method - such calls are currently needed to replicate the quoted formula as you perhaps observed.
Done! However, do you think it would be better to do something just like this
def forward_attention(self, input_space):
placeholder = torch.zeros(input_space.shape).to(self.device)
placeholder2 = torch.zeros(input_space.shape).to(self.device) # Define placeholder2
for k in range(len(self.multi_head)):
attention_matrix = self.softmax3(self.multi_head[k](input_space))
attended_matrix = attention_matrix * input_space
placeholder = torch.add(placeholder,attended_matrix)
placeholder2 = torch.add(placeholder2, attention_matrix)
placeholder /= len(self.multi_head)
placeholder2 /= len(self.multi_head)
out = placeholder
return out, placeholder2 # return placeholder2 in addition to out`
merged the first one for now, don't see apparent need for this, however, feel free to open another PR and we can test it out
Hello, I have a question regarding the implementation of SAN, especially the "Instance-level aggregations" as you call it in the paper. I assume we get those values when calling the get_attention function in which the return_sotfmax is True. In the forward_attention function though: https://github.com/SkBlaz/san/blob/4ab2397570c9131c9983eb53475760929a05a343/san/__init__.py#L67C1-L80C19 when return_softmax = True we get the different matrices for each head, then via placeholder we add them and then we divide them by the number of heads (performing the average) and only then we perform the softmax function. Is this correct? So, my question is when you do this aren't you actually first doing the average and only then performing the softmax on the averages and not the other way around of first applying the softmax to each head and then doing the average as seen in the paper:
Shouldn't it be something like this (this way we get the instance attention matrix with return_softmax = False as well as the output):
Like factoring out the X in this expression: Am I overlooking something and this is a stupid question? Sorry if I put it in a confusing manner