kv = (k.transpose(-2, -1) * (N * -0.5)) @ (v (N -0.5))
I can understand that N is for scaling dot product attention. But the shapes of k and v have been changed by sr_ration to (B, N1, C), i.e. N1 = N/sr_ration2. So shouldn't it be, to add an If self.sr_ratio>1, to deal with this part, to replace N with N1. Since this part isn't mentioned in the paper, it's just my personal understanding and I hope that you will be able to explain it, thank you very much.
kv = (k.transpose(-2, -1) * (N * -0.5)) @ (v (N -0.5)) I can understand that N is for scaling dot product attention. But the shapes of k and v have been changed by sr_ration to (B, N1, C), i.e. N1 = N/sr_ration2. So shouldn't it be, to add an If self.sr_ratio>1, to deal with this part, to replace N with N1. Since this part isn't mentioned in the paper, it's just my personal understanding and I hope that you will be able to explain it, thank you very much.