There may be some errors in class

xudh1991 commented 10 months ago

rect_to_img In your project, I used my own training data to train the model and found that some targets could never be trained. Therefore, I searched for the problem and found the location on the image. pts_rect is P2 is Obtain results results But my image size is 2560*1150, The projected 3D coordinates exceeded the boundary, causing the target in the image to be unable to enter training. According to coordinate transformation rules 微信截图_20231226093442 I think this position should be

pts_img = (pts_2d_hom[:, 0:2].T / pts_2d_hom[:, 2]).T

rather than

pts_img = (pts_2d_hom[:, 0:2].T / pts_rect_hom[:, 2]).T

However, after changing this position, I still haven't achieved ideal results. May I ask if my thinking is correct?? If there is an error, please point it out. If it is correct, are there any other relevant positions that need to be changed? Why did I not achieve the desired result Sincerely in need of help, thank you very much

abhi1kumar commented 10 months ago

Hi @xudh1991 Thank you for your interest in DEVIANT.

I used my own training data to train the model and found that some targets could never be trained. The projected 3D coordinates exceeded the boundary, and so the target is not used in training.

You are correct. The projected 3D coordinates outside the image mean the camera does not see that particular 3D box. Therefore, detecting such 3D boxes is impossible with any image-based detector. Hence, those targets are not used in training.

Note that the datasets obtain and annotate 3D boxes using LiDAR or stereo images, which have a wider field of view (FoV) compared to a monocular camera. As such, some 3D boxes are usually outside the camera's FoV. The following figure (Courtesy: webgl) illustrates this point in the Bird Eye View. The LiDAR sees the top 3D box (rectangle) and therefore, 3D box appears in the annotated labels. The camera can not see this 3D box and DEVIANT codebase excludes such 3D box in training.

side_view_frustum

I think this position should be
pts_img = (pts_2d_hom[:, 0:2].T / pts_2d_hom[:, 2]).T

Thank you for noticing this bug. This code is from the GUPNet codebase. DEVIANT codebase is based on this codebase.

After changing this position, I still haven't achieved ideal results. May I ask if my thinking is correct?? If there is an error, please point it out. If it is correct, are there any other relevant positions that need to be changed? Why did I not achieve the desired result

Your thinking is absolutely spot on. The reason why the bug does NOT impact KITTI, Waymo and nuScenes datasets is the P2 calibration matrices of all these datasets have the second row (row index starts from zero) as [0, 0, 1, 0] which means the pts_rect_hom z-coordinate = pts_2d_hom z-coordinate and therefore, dividing by any of them leads to exact same result. As an example, consider sample calib matrices from the validation set of these three datasets:

KITTI

P0: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 0.000000000000e+00 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P1: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 -3.875744000000e+02 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P2: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 4.485728000000e+01 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 2.163791000000e-01 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 2.745884000000e-03
P3: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 -3.395242000000e+02 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 2.199936000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 2.729905000000e-03
R0_rect: 9.999239000000e-01 9.837760000000e-03 -7.445048000000e-03 -9.869795000000e-03 9.999421000000e-01 -4.278459000000e-03 7.402527000000e-03 4.351614000000e-03 9.999631000000e-01
Tr_velo_to_cam: 7.533745000000e-03 -9.999714000000e-01 -6.166020000000e-04 -4.069766000000e-03 1.480249000000e-02 7.280733000000e-04 -9.998902000000e-01 -7.631618000000e-02 9.998621000000e-01 7.523790000000e-03 1.480755000000e-02 -2.717806000000e-01
Tr_imu_to_velo: 9.999976000000e-01 7.553071000000e-04 -2.035826000000e-03 -8.086759000000e-01 -7.854027000000e-04 9.998898000000e-01 -1.482298000000e-02 3.195559000000e-01 2.024406000000e-03 1.482454000000e-02 9.998881000000e-01 -7.997231000000e-01

Waymo

P0: 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
P1: 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
P2: 2087.4761030684967 0.0 942.9076736705708 0.0 0.0 2087.4761030684967 651.2327433418823 0.0 0.0 0.0 1.0 0.0
P3: 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
R0_rect: 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
Tr_velo_to_cam: 0.00894667425022156 -0.9999292084527203 0.007844431335463198 -0.053659159542238585 0.005803043371535652 -0.007792694782273371 -0.9999527981838234 2.1063602105556676 0.999943139237171 0.00899177352621294 0.005732913863368047 -1.555960999390042

nuScenes

P0: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00
P1: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00
P2: 1.266417203047e+03 0.000000000000e+00 8.162670197448e+02 0.000000000000e+00 0.000000000000e+00 1.266417203047e+03 4.915070657929e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P3: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00
R0_rect: 1.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00
Tr_velo_to_cam: 3.487968666398e-03 -9.999708566009e-01 6.791172464157e-03 1.190663537703e-02 1.859214393651e-02 -6.725192192724e-03 -9.998045328832e-01 -3.249862680961e-01 9.998210671207e-01 3.613549339171e-03 1.856814483859e-02 -7.590020378669e-01
Tr_imu_to_velo: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00

Feel free to raise a PR for this issue. Also, feel free to post more questions and we will be happy to clarify further.

xudh1991 commented 10 months ago

Thank you very much for your reply. Your answer is very detailed, and I think I know how to solve this problem. In addition, I would like to ask another question, which appears in many 3D monocular object detection, but I have not quite understood it. If you feel that this question is too basic, you may not answer it, and I will also close this question 微信截图_20231226155524 The total loss in the algorithm is obtained by adding up multiple loss terms, However, adding the loss term of aa loss during the training process may result in negative values, as shown in the above figure. Will this simple summation method have a negative impact on the total loss?? loss

abhi1kumar commented 10 months ago

Your new question is unrelated to the current issue. Therefore, would you mind opening a new issue for this and I will answer this question.

PS: No question is too basic.

abhi1kumar / DEVIANT

There may be some errors in class #23