Implementation issues of the Efficient Self-Attention module

Firstly, thank you to all the authors for their impressive work. SegFormer has indeed demonstrated extraordinary performance.

When studying the code carefully, I noticed that you mentioned in the article that the Efficient Self Attention module reshaped the $K$ matrix and utilized the linear projection to reduce the number of parameters by $R$ times, and set different reduction rates at each stage ($[64, 16, 4, 1]$ from stage-1 to stage-4). I guess this part of the code is located in the Attention class, but after reading the code multiple times, I reckon that there is only code for the ordinary multi-head self-attention mechanism in the code block, and no code implementation that matches the concept of the Efficient Self Attention module in the article has been found.

Did I misunderstand or was there a missing part in the code? I hope to receive your answer, which is very important to me. THX!

NVlabs / SegFormer

Implementation issues of the Efficient Self-Attention module #148