Open jialeli1 opened 3 years ago
Hi, Good question. I do not think it is wrong and please pay attention to the dimension of normalization which is different from original self-attention.
I think @jialeli1 is right. If you don't transpose the attention matrix before the matrix product, the matrix product makes no sense (pay attention to the meaning of each dimension). And I guess because the author didn't transpose the attention matrix, he needed to do the normalization proposed in the paper. However, if you transpose the attention matrix and do the normalization proposed by the original attention paper, you will find the proposed normalization is not necessary. I have re-implemented the segmentation code using PyTorch and got a quite good result.
@JunweiZheng93 Could you share your implementation code? Thank you so much
@JunweiZheng93 Could you share your implementation code? Thank you so much
Hi.
As here, the attention matrix should be transposed before the matrix product, if I understand it correctly.
Here is my draft of the calculation about the dimension of the matrix product.