Closed 123456789asdfjkl closed 1 year ago
I'm also puzzled by this point, applying softmax on the last dimension of A will make the attention score all 1.
I think so.
Hello Guys,
Thank you for your note.
We are verifying it and will update you ASAP.
Best regards, Abdelrahman.
Hello,
I reproduced your issue. Yes, when dim=1 the attention weights are ones. However, changing the dimension to -1 will reduce the performance. The way I solved this issue and produced the same performance for all models is to replace the softmax function with a traditional normalization function as follows:
A = torch.nn.functional.normalize(A, dim=1)
We will update the code and models after fixing this issue soon.
Best regards, Abdelrahman.
https://github.com/Amshaker/SwiftFormer/blob/2d331149678472c72b712fd72edfc94453870798/models/swiftformer.py#L175 Hi! Thank you for your great work! A’s shape is [B, N, 1], so I think dim=1, don't you think.