Thanks for sharing the great work!
Have you conwsidered the deformable attention? I believe in the paper you were trying to compare queries at each map location to keys at each pixel accross all six perspective views, right?
Great point - you're correct in describing our method.
Deformable attention also makes a lot of sense, and in fact is used by BEVFormer, which is the 2nd top performing camera only method on NuScenes detection
Thanks for sharing the great work! Have you conwsidered the deformable attention? I believe in the paper you were trying to compare queries at each map location to keys at each pixel accross all six perspective views, right?