Open DYZhang09 opened 10 months ago
I have the same question. The input to MultiScaleDeformableAttnFunction requires value and sampling_locations to have consistent dimensions, but spatial_shapes can only take the hw (height and width) dimensions. This should pose certain issues.
Hi, thanks for your fantastic work! I have a question about the implementation of the voxel self-attention.
The paper writes ``These sampling points share the same height $z_k$, but with different learnable offsets for $(x_i^m , y_j^m )$. This encourages the voxel queries to interact in the BEV plane". To my understanding, this operation equals to split the voxel features into bev slices through the height dimension and perform deformable attention seperately for each bev slices, is this right? I wonder why not using 3D deformable attention directly?
By the way, I think the implementation of voxel self-attention may also has some problems. https://github.com/Robertwyq/PanoOcc/blob/898b2a457af45bcef8bdbbe8cff10e3eef26485f/projects/mmdet3d_plugin/bevformer/modules/occ_temporal_attention.py#L244-L254 The
spatial_shapes
here is[50, 50, 16]
, which is a 3-dimensional vector. However, according to the implementation ofMultiScaleDeformableAttnFunction_fp32
, it seems only accepts 2-dimensional spatial shapes (line 276 & 277), which means it can only attend to the first 50x50=2500 queries. https://github.com/Robertwyq/PanoOcc/blob/898b2a457af45bcef8bdbbe8cff10e3eef26485f/ops/src/cuda/ms_deform_im2col_cuda.cuh#L272-L294Is there something wrong with this implementation? Or maybe there are some details that I didn’t notice.