Robertwyq / PanoOcc

[CVPR 2024] PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
GNU General Public License v3.0
153 stars 8 forks source link

Qustion about the implementation of voxel self-attention #9

Open DYZhang09 opened 10 months ago

DYZhang09 commented 10 months ago

Hi, thanks for your fantastic work! I have a question about the implementation of the voxel self-attention.

The paper writes ``These sampling points share the same height $z_k$, but with different learnable offsets for $(x_i^m , y_j^m )$. This encourages the voxel queries to interact in the BEV plane". To my understanding, this operation equals to split the voxel features into bev slices through the height dimension and perform deformable attention seperately for each bev slices, is this right? I wonder why not using 3D deformable attention directly?

By the way, I think the implementation of voxel self-attention may also has some problems. https://github.com/Robertwyq/PanoOcc/blob/898b2a457af45bcef8bdbbe8cff10e3eef26485f/projects/mmdet3d_plugin/bevformer/modules/occ_temporal_attention.py#L244-L254 The spatial_shapes here is [50, 50, 16], which is a 3-dimensional vector. However, according to the implementation of MultiScaleDeformableAttnFunction_fp32, it seems only accepts 2-dimensional spatial shapes (line 276 & 277), which means it can only attend to the first 50x50=2500 queries. https://github.com/Robertwyq/PanoOcc/blob/898b2a457af45bcef8bdbbe8cff10e3eef26485f/ops/src/cuda/ms_deform_im2col_cuda.cuh#L272-L294

Is there something wrong with this implementation? Or maybe there are some details that I didn’t notice.

code88881234 commented 3 months ago

image I have the same question. The input to MultiScaleDeformableAttnFunction requires value and sampling_locations to have consistent dimensions, but spatial_shapes can only take the hw (height and width) dimensions. This should pose certain issues.