Did you try to directly regress the x, y, z of the 3D bounding center?

yilinliu77 commented 3 years ago

Thanks for sharing the excellent work and the relevant code!

My question is:

After my own experiments and some slight changes, I found the performance (3D IOU) of the model was pretty bad if I directly regress the coordinate of the 3D bbox centre. However, the 2D Iou was better and reached 90% in the first few epochs. I am really confused about this result. Have you tried to directly regress these parameters? Do you have any idea about it?

These results are generated by resnet18 on Chen Split.

Evaluation result of regressing alpha Car AP(Average Precision)@0.70, 0.70, 0.70: bbox AP:88.38, 71.46, 64.18 bev AP:8.49, 5.02, 4.39 3d AP:2.49, 1.81, 1.51 aos AP:88.33, 71.40, 64.11 Car AP(Average Precision)@0.70, 0.50, 0.50: bbox AP:88.38, 71.46, 64.18 bev AP:51.12, 37.06, 34.06 3d AP:31.00, 20.17, 19.07 aos AP:88.33, 71.40, 64.11

Evaluation result of regressing x, y, z and theta Car AP(Average Precision)@0.70, 0.70, 0.70: bbox AP:96.50, 80.83, 63.48 bev AP:0.75, 0.43, 0.38 3d AP:0.22, 0.21, 0.08 Car AP(Average Precision)@0.70, 0.50, 0.50: bbox AP:96.50, 80.83, 63.48 bev AP:6.38, 4.43, 3.37 3d AP:4.52, 3.50, 2.40

Also, I noticed that the result on the validation set is pretty good using the full model. But it seems that there is a huge performance gap between the validation set and the test set on the KITTI server. So is it right that currently, the main problem is overfitting?

Owen-Liuyuxuan commented 3 years ago

Could you elaborate on how you "directly regress the coordinate of the 3D bbox center."? you mean the (x,y,z) coordinate of the 3D object in 3D world? That is basically impossible to directly regress. At least you need to regress (cx, cy, z), where (cx, cy) is the 3D object center projected on the image (regress them using based on the prior of the anchors' position).

Some code snippets would help.

Moreover, 2D bounding boxes are easier to trained.

yilinliu77 commented 3 years ago

However, you directly regress the z attribute of an anchor box. That is a 3D attribute that we lose in the 2D image. Then, why cann't we add two more attributes to the prediction? So the estimated attribute becomes delta_x2d, delta_y2d, delta_w, delta_h, x_3d, y_3d, z_3d, w, h, l, theta, which we can also recover a 3D bbox through this prediction.

Owen-Liuyuxuan commented 3 years ago

Because the z_3d is related to the apparent size of an object, so the network is able to analyze it with local semantic information. I also inject huge priors about the z_3d to each anchor (the mean and variance are collected in the training set, varying across 36 anchors) so that it becomes easier to learn the z_3d.

But as the object shift in x-direction, the appearance does not change and the network can only infer that by its "location", which is difficult for a network to learn with CNN,

yilinliu77 commented 3 years ago

Ok, I get it! Thanks for your patience!

Also, do you have any idea about the overfitting problem? The model works quite well in the validation set (Chen split). But the performance on KITTI's test server drops. So it seems that overfitting is the main problem of the performance gap.

However, the 2D IOU goes well in both two sets although the model has not seen the test data. In my opinion, the prediction of the 3D attribute causes severe overfitting on the training data. Maybe that is a potential improvement in the future. Do you agree with that?

Owen-Liuyuxuan commented 3 years ago

We generally agree that images on the test set of the KITTI3D benchmark are more "different" from the training images and they are probably more difficult.

I agree that 2D predictions actually generalize quite well while 3D predictions are not that good. More researches/analysis would be really helpful :)

yilinliu77 commented 3 years ago

Thanks for your quick reply! It is really helpful!

Owen-Liuyuxuan / visualDet3D

Did you try to directly regress the x, y, z of the 3D bounding center? #23