Amshaker / SwiftFormer

[ICCV'23] Official repository of paper SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
247 stars 25 forks source link

some problems about code #2

Closed 123456789asdfjkl closed 1 year ago

123456789asdfjkl commented 1 year ago

https://github.com/Amshaker/SwiftFormer/blob/2d331149678472c72b712fd72edfc94453870798/models/swiftformer.py#L175 Hi! Thank you for your great work! A’s shape is [B, N, 1], so I think dim=1, don't you think.

sunny2109 commented 1 year ago

I'm also puzzled by this point, applying softmax on the last dimension of A will make the attention score all 1.

cslvjt commented 1 year ago

I think so.

Amshaker commented 1 year ago

Hello Guys,

Thank you for your note.

We are verifying it and will update you ASAP.

Best regards, Abdelrahman.

Amshaker commented 1 year ago

Hello,

I reproduced your issue. Yes, when dim=1 the attention weights are ones. However, changing the dimension to -1 will reduce the performance. The way I solved this issue and produced the same performance for all models is to replace the softmax function with a traditional normalization function as follows:

A = torch.nn.functional.normalize(A, dim=1)

We will update the code and models after fixing this issue soon.

Best regards, Abdelrahman.