Projection of 3D box does not give tight 2D box

autonomousvision / kitti360Scripts

This repository contains utility scripts for the KITTI-360 dataset.

MIT License

365 stars 59 forks source link

Projection of 3D box does not give tight 2D box #80

Closed abhi1kumar closed 1 year ago

abhi1kumar commented 1 year ago

Hi @yiyiliao Thank you for a great work. I wanted to get tight 2D boxes which are consistent with the camera projection equation. Hence, I visualized some 3D bounding boxes on the images, and the projection of the 3D box to the image plane does not give me tight 2D boxes.

e.g. Consider data_2d_raw/2013_05_28_drive_0004_sync/image_00/data_rect/0000003799.png with the 3D box.

3d_labelling

Subfigure 1: Visualization from our custom viewer Pink: Projected 3D box Orange: Bounds of projected 3D box Green: Expected Tight 2D box

Subfigure 2: Visualization from kitti360viewer.py script for a sanity check.

I expect the orange 2D box should be the same as the green 2D box. However, there are huge differences, especially for the nearby objects.

I wanted to ask if this caused by noisy annotations or something else. And also if it is possible to detect and remove / correct such cases.

yiyiliao commented 1 year ago

Hi, thank you for your interest. It seems that the 3D bounding box of this car is indeed not as tight as it could be. However, even with a perfect 3D bounding box, it is not feasible to get the tight 2D bounding box you expected (the green one) from the projected 3D bounding boxes. I don't think there is a good way to solve this. You may use the 2D instance map to obtain a tighter 2D bounding box, or run a 2D detection network for this purpose.

abhi1kumar commented 1 year ago

Thankyou @yiyiliao, for your quick clarification. I completely understand.

However, I guess that the 3D boxes in KITTI-360 are bigger than the KITTI dataset, and the bigger car sizes in KITTI-360 lead to a not-so-tight projected 2D box. I calculated the average 3D dimension of the car boxes in the KITTI and KITTI-360 datasets to verify. Here are the 3D dimensions computed from the training splits of the two datasets:

Dimension (m)	KITTI	KITTI-360 (Mean)	KITTI-360 (Median)
h3d	1.53	1.64	1.57
w3d	1.62	2.07	2.03
l3d	3.88	4.56	4.53

I expected both these datasets to have very similar car sizes since both these datasets are collected in the same geographical location. It would be great if you can answer if my guess is correct.

yiyiliao commented 1 year ago

Thank you for digging further into the problem! I agree that the average car sizes should be similar for KITTI and KITTI-360. I believe the reason that the 3D bounding boxes are larger in KITTI-360 is due to the following reasons: 1) The datasets are annotated using different annotation tools. I don't know the annotation detail of the KITTI dataset but I believe it should be annotated using per-frame point cloud. In KITTI-360, the cars are annotated using the accumulated point cloud, which means the observations are more complete. For example, the width could be larger when including the rear-view mirror:

2) There could be some noise in the annotation process. Our annotators are instructed to annotate the car as tight as possible but there could be a few cases where the bounding boxes are not that tight.

abhi1kumar commented 1 year ago

Thank you, @yiyiliao, for being spot on. I agree with you.

The datasets are annotated using different annotation tools. I don't know the annotation detail of the KITTI dataset but I believe it should be annotated using per-frame point cloud. In KITTI-360, the cars are annotated using the accumulated point cloud, which means the observations are more complete. For example, the width could be larger when including the rear-view mirror:

I agree the width would change because of including rear-view mirrors. However, the difference in l3d is a bit bigger.

There could be some noise in the annotation process. Our annotators are instructed to annotate the car as tight as possible but there could be a few cases where the bounding boxes are not that tight.

I guess both point cloud accumulation errors and annotators errors give rise to the differences we see.