Firstly, thank you to all the authors for their impressive work. SegFormer has indeed demonstrated extraordinary performance.
When studying the code carefully, I noticed that you mentioned in the article that the Efficient Self Attention module reshaped the $K$ matrix and utilized the linear projection to reduce the number of parameters by $R$ times, and set different reduction rates at each stage ($[64, 16, 4, 1]$ from stage-1 to stage-4). I guess this part of the code is located in the Attention class, but after reading the code multiple times, I reckon that there is only code for the ordinary multi-head self-attention mechanism in the code block, and no code implementation that matches the concept of the Efficient Self Attention module in the article has been found.
Did I misunderstand or was there a missing part in the code? I hope to receive your answer, which is very important to me. THX!
Firstly, thank you to all the authors for their impressive work. SegFormer has indeed demonstrated extraordinary performance.
When studying the code carefully, I noticed that you mentioned in the article that the
Efficient Self Attention
module reshaped the $K$ matrix and utilized thelinear projection
to reduce the number of parameters by $R$ times, and set different reduction rates at each stage ($[64, 16, 4, 1]$ from stage-1 to stage-4). I guess this part of the code is located in theAttention
class, but after reading the code multiple times, I reckon that there is only code for the ordinary multi-head self-attention mechanism in the code block, and no code implementation that matches the concept of theEfficient Self Attention
module in the article has been found.Did I misunderstand or was there a missing part in the code? I hope to receive your answer, which is very important to me. THX!