Question about SwinTransformer adopted in your decoder

I have been struggling to reproduce your results by myself. Based on your paper, you mentioned that SwinTransformer is adopted in each layer of your spatial-temporal decoder. It seems that there are four layers, each of which corresponds to one of four BEV query maps (of sizes 200x200, 100x100, 50x50, 25x25). You also mentioned that the window size for SwinTransformer was set to 4x4. However, I don't think it makes sense. This is because 50 and 25 are not divided by 4 so that the original SwinTransformer could not deal with BEV query maps of sizes 50x50 and 25x25. Therefore, I figured you must have resized the BEV query maps before feeding the BEV query maps into SwinTransformer. Regarding this, could you give me some hints? I have implemented your model based on your paper so long, but I couldn't even close to your results.

MediaBrain-SJTU / TBP-Former

Question about SwinTransformer adopted in your decoder #6