Question about the Voxel Features

TRAILab / CaDDN

Categorical Depth Distribution Network for Monocular 3D Object Detection (CVPR 2021 Oral)

Apache License 2.0

366 stars 62 forks source link

Question about the Voxel Features #77

Closed taylover-pei closed 3 years ago

taylover-pei commented 3 years ago

Congratulations on your great work!

I have read your paper and have several questions that bother me:

In your work,

Firstly, the voxel grid is first generated.
Secondly, use the gird_to_lidar, lidar_to_cam, cam_to_img transformation to find the correspondence between the grid coordinates and the image coordinates.
Thirdly, use grid_sample to sample features from Frustum to Voxel.
Finally, Voxelcollapse to BEV features

Since, in my opinion, the BEV features represent the world coordinates. My question is, why not just use BEV features to generate 'BEV grid', which represents the real world (lidar) coordinates? So, the gird_to_lidar step can be omitted. Am I right?

I am still confused about the 'Voxel Features'. I don'y know what is it used for?

Thank you very much, looking forward to your replay!

codyreading commented 3 years ago

Hi and thanks for the interest!

So to answer your second question, voxel_features refers to the 3D voxel feature grid, which is referred in the paper as V. We generate this as an intermediate 3D representation before collapsing it to a BEV feature grid bev_features.

For both voxel_features and bev_features, their coordinates aren't in real world coordinates but rather in what I refer to as grid coordinates, where the coordinates are the grid cell index. Meaning that coordinates range from (0, R) where R is the maximum number of cells in a specific axis. Real world coordinates range from values in metres, which is the range shown here. You need the grid_to_lidar transformation to convert from grid indices to real world coordinates in meters.

taylover-pei commented 3 years ago

Thanks for your reply!

There exists another question:

Is it possible to directly transform the Frustum Features to BEV features without using the Voxel Features?

Thank you very much, looking forward to your replay!

codyreading commented 3 years ago

Yes, it would be possible if you use the same strategy as PointPillars. Essentially, you construct your voxel grid such that it only has one height layer (voxel_size_z = 4 for KITTI). This results in voxel_features being equivalent bev_features, and can use it directly in the 3D object detection stage. An issue with this a forsee is that you only have one sampling point for each "pillar" (Center of the pillar in CaDDN), where the pillar feature should include information from all points within the pillar. This is why we construct the voxel grid first, and collapse it to BEV such that it includes information from all points within the pillar.

taylover-pei commented 3 years ago

Thank you very much. I have got it! It really helps me a lot.