Closed taylover-pei closed 3 years ago
Hi and thanks for the interest!
So to answer your second question, voxel_features
refers to the 3D voxel feature grid, which is referred in the paper as V. We generate this as an intermediate 3D representation before collapsing it to a BEV feature grid bev_features
.
For both voxel_features
and bev_features
, their coordinates aren't in real world coordinates but rather in what I refer to as grid
coordinates, where the coordinates are the grid cell index. Meaning that coordinates range from (0, R) where R is the maximum number of cells in a specific axis. Real world coordinates range from values in metres, which is the range shown here. You need the grid_to_lidar
transformation to convert from grid indices to real world coordinates in meters.
Thanks for your reply!
There exists another question:
Is it possible to directly transform the Frustum Features
to BEV features
without using the Voxel Features
?
Thank you very much, looking forward to your replay!
Yes, it would be possible if you use the same strategy as PointPillars. Essentially, you construct your voxel grid such that it only has one height layer (voxel_size_z = 4 for KITTI). This results in voxel_features
being equivalent bev_features
, and can use it directly in the 3D object detection stage. An issue with this a forsee is that you only have one sampling point for each "pillar" (Center of the pillar in CaDDN), where the pillar feature should include information from all points within the pillar. This is why we construct the voxel grid first, and collapse it to BEV such that it includes information from all points within the pillar.
Thank you very much. I have got it! It really helps me a lot.
Congratulations on your great work!
I have read your paper and have several questions that bother me:
In your work,
gird_to_lidar
,lidar_to_cam
,cam_to_img
transformation to find the correspondence between the grid coordinates and the image coordinates.grid_sample
to sample features fromFrustum
toVoxel
.Voxel
collapse toBEV features
Since, in my opinion, the
BEV features
represent the world coordinates. My question is, why not just useBEV features
to generate 'BEV grid', which represents the real world (lidar) coordinates? So, thegird_to_lidar
step can be omitted. Am I right?I am still confused about the 'Voxel Features'. I don'y know what is it used for?
Thank you very much, looking forward to your replay!