Qustion about the implementation of voxel self-attention

Hi, thanks for your fantastic work! I have a question about the implementation of the voxel self-attention.

The paper writes ``These sampling points share the same height $z_k$, but with different learnable offsets for $(x_i^m , y_j^m )$. This encourages the voxel queries to interact in the BEV plane". To my understanding, this operation equals to split the voxel features into bev slices through the height dimension and perform deformable attention seperately for each bev slices, is this right? I wonder why not using 3D deformable attention directly?

By the way, I think the implementation of voxel self-attention may also has some problems. https://github.com/Robertwyq/PanoOcc/blob/898b2a457af45bcef8bdbbe8cff10e3eef26485f/projects/mmdet3d_plugin/bevformer/modules/occ_temporal_attention.py#L244-L254 The spatial_shapes here is [50, 50, 16], which is a 3-dimensional vector. However, according to the implementation of MultiScaleDeformableAttnFunction_fp32, it seems only accepts 2-dimensional spatial shapes (line 276 & 277), which means it can only attend to the first 50x50=2500 queries. https://github.com/Robertwyq/PanoOcc/blob/898b2a457af45bcef8bdbbe8cff10e3eef26485f/ops/src/cuda/ms_deform_im2col_cuda.cuh#L272-L294

Is there something wrong with this implementation? Or maybe there are some details that I didn’t notice.

Robertwyq / PanoOcc

Qustion about the implementation of voxel self-attention #9