dvlab-research / DSGN

DSGN: Deep Stereo Geometry Network for 3D Object Detection (CVPR 2020)
MIT License
325 stars 50 forks source link

A question about demo video BEV point cloud #1

Closed gujiaqivadin closed 4 years ago

gujiaqivadin commented 4 years ago

Hello, chenyilun95! Thanks for your great work of monocular 3D object detection. After watch your demo, I am confused about the right_bottom BEV point cloud, is it original velodyne point cloud or pseudo lidar generated by depth estimation results of your net.

chenyilun95 commented 4 years ago

We show the BEV detection result on the ground-truth point cloud. I will add a note about it. Thanks for your reminder!

gujiaqivadin commented 4 years ago

Also, after watching your demo video. I found that your depth map is grey image. But I found that a lot of areas like sky and nearyby cars are somewhat white (which means there is no value in these pixels). We know the output of depth maps are all-pixel-dense map(every pixel has a depth value >0). How do you get the 0 values in skys and other areas. Thanks!

chenyilun95 commented 4 years ago

As stated in the new version of the arxiv paper, some noise observed in the predicted depth map is mainly caused by the implementation details. (1) Noise in the near and far part: 3D volumes are constructed in [2, 40.4] (meters). (2) Noise and large white zone in the higher region ($>$3m): The stereo branch is trained with a sparse GT depth map (64 lines around [-1,3 (meters) along the gravitational $z$-axis, captured by a 64-ray LiDAR), which is quite different from the full dense depth map.

gujiaqivadin commented 4 years ago

Yes. Now I see in your paper that the ground depth map is generated from pseudo lidar in other papers. But these point clouds are tend to be dense in depth maps, because they are generated from depth estimation. The full dense depth map means every pixel has a depth(maybe 0.xxx or 80.xxx or between them). And the network output is always full-dense when it is supervised by gt depth map. I want to know if you delete some points in your predict depth map or point cloud to make some areas to be white. Because I think network output is hard to get 0 value in some pixels.

chenyilun95 commented 4 years ago

As you said, the output depth map is dense. Only the LiDAR points inside the predefined range are used for training. Pixels outside the sparse 64 LiDAR lines are ignored in the loss function.

gujiaqivadin commented 4 years ago

Yes. In depth prediction tasks we only supervised the final depth map in valid points(which has a depth value due to lidar points). But no matter how you supervised the network, the output depth image is still full image dense(which means do not have 0 depth value in any pixel) due to CNNs in network. My question is the white areas are 0 values and how to get them in full dense output depth map. Thanks a lot!

chenyilun95 commented 4 years ago

The white region is the farthest. The white region (above 3m) cannot get trained in the dataset. The network can actually output anything about them without increasing the loss. For the reason why the network thinks them as the farthest (with some noise) instead of other depth value, I do not figure out a good reason. The clear 3meter high boundary might be related to the following detection network since only the region of [-1,3] meter is converted to 3D geometric volume, which affects the training of the stereo network. But overall, the output of them does not affect the detection result and the background's depth is meaningless in the real application. Thanks!

chenyilun95 commented 4 years ago

Closed the question.