Waymo Dataset Filtering

johannes-tum commented 11 months ago

Hi, congratulations to your very nice paper!

I would have a question regarding Waymo. In the paper you mention that you filter out objects with depth <= 2m and objects with too few lidar points (car: 100, pedestrian / cyclists: 50). In general I think it makes sense to do that.

I wonder whether that is even strict enough. Here is an example where I used your approach for data generation and plotted the results: 284152 Here the labels for the two cars on the right that are not even visible anymore: Car 0 0 -10 1858.27 625.86 1920.0 872.42 1.84 2.14 4.75 9.95 1.78 17.5 1.52 1094 Car 0 0 -10 1769.02 728.22 1920.0 1280.0 1.8 2.16 4.86 4.17 2.15 5.16 -1.62 7286

Do you do more filtering that I am not aware of at the moment? And do you also filter the ground truth labels in the same way for evaluation as for training? If not, what is the difference?

Best wishes

Johannes

abhi1kumar commented 11 months ago

Hi @johannes-tum Thank you for your interest in DEVIANT and for your great questions.

Here are the labels for the two cars on the right that are not even visible anymore

MonoDLE, CVPR21 conclusion says that discarding distant samples from the training set will stop them from distracting the detector. KITTI samples have occlusion+truncation labels. Unlike KITTI, Waymo samples do not have occlusion or truncation labels. Therefore, we use the number of lidar points as the proxy for the Hard (heavily truncated/distant) Waymo samples. Hence, we use the number of lidar points to filter out the hard Waymo samples during training. In other words, we choose samples with good enough lidar points (car: 100, pedestrian/cyclists: 50). Please note that this is a simplistic handcrafted rule, and more sophisticated rules are definitely possible.

Do you do more filtering that I am not aware of at the moment?

Yes, we use more filtering rules during training. Please check out the waymo dataloader for all the rules used in training.

do you also filter the ground truth labels in the same way for evaluation as for training?

We do NOT filter out any ground truth samples/labels during evaluation. In other words, the waymo evaluation scripts at 0.7 and 0.5 IoU3D thresholds use all samples in evaluation without filtering.

If not, what is the difference?

Filtering hard samples in training results in a better or less lousy detector.

I hope that I answered your questions. However, feel free to ask more clarifications and we will try our best to further clarify.

PS: We would be super happy if you could support our repo by starring it.

johannes-tum commented 11 months ago

Hi @abhi1kumar, Thank you very much for your elaborate answer! A few follow-up questions:

Since you don't filter the ground truth labels in any way: The two cars on the right side would probably not be detected by the detector during test time. So, this means it is basically impossible to achieve maximum performance? The two cars have a high number of lidar points (1094, 7286). You also don't filter the ground truth labels for the number of lidar points then?
You mention in the paper that you use only every third frame for training. Where do you that in the code?
Would you check, whether I missed something for the filtering during training time:
- You only learn car, pedestrian, cyclist here
- You only consider objects with at least 2 meter depth. The unknown is only relevant for KITTI I assume
- The width and height of the 2D bounding box must be positive
- You only consider objects, where the 3D center is inside the image. This means the two cars on the image above would not be considered during training
- You consider the number of lidar points hitting the object
- For KITTI truncation must be lower than 50% and occlusion must be smaller equal 2

Thanks a lot! I will support you.

Johannes

abhi1kumar commented 11 months ago

Since you don't filter the ground truth labels in any way: The two cars on the right side would probably not be detected by the detector during test time. So, this means it is basically impossible to achieve maximum performance? The two cars have a high number of lidar points (1094, 7286).

Yes, they might be missed. It is anyways impossible to get ideal performance with monocular detectors because of the scale-depth ambiguity.

You mention in the paper that you use only every third frame for training. Where do you that in the code?

The subsampling takes place during training set preparation when you run setup_split.py. This file uses train_org for training split. Please take a look at waymo train_org file which has one in every third frame.

Would you check, whether I missed something for the filtering during training time:

You only learn car, pedestrian, cyclist here

You only consider objects with at least 2 meter depth. The unknown is only relevant for KITTI I assume

The width and height of the 2D bounding box must be positive

You only consider objects, where the 3D center is inside the image. This means the two cars on the image above would not be considered during training

You consider the number of lidar points hitting the object

For KITTI truncation must be lower than 50% and occlusion must be smaller equal 2

You are correct.

abhi1kumar commented 11 months ago

I am closing this issue, but feel free to open if you have more queries.

abhi1kumar / DEVIANT

Waymo Dataset Filtering #21