Add deep learning based 3D bouding box shape estimation algorithm

kaancolak commented 1 month ago

Checklist

[X] I've read the contribution guidelines.
[X] I've searched other issues and no duplicate issues were found.
[X] I've agreed with the maintainers that I can plan this task.

Description

In the current version of Autoware, the L-shape fitting algorithm is used to find 3D bounding boxes for vehicles from object clusters. However, this algorithm has some limitations for accurately identifying these bounding boxes. If the vehicle's point cloud includes side mirrors, antennas, or only shows the back of the vehicle (forming an I-shape), the algorithm often fails. Additionally, it can struggle in situations where there are only a few points available.

The Shape-fitting algorithm has crucial roles especially when using camera-lidar fusion or using Apollo instance segmentation as an object detector.

L shape fitting results :

https://github.com/autowarefoundation/autoware.universe/assets/12658936/fef4f1ae-b364-473e-aaa4-b55e6eb69f30

Purpose

Add a neural network-based 3D bounding box shape estimation algorithm. This could be an optional feature in the shape_estimation package, allowing users to choose it if desired.

Possible approaches

The point-based method can be used for estimating object bounding boxes. Typically, networks like PointNet/PointNet++ are used for classifying point clouds or performing point-wise instance segmentation, but they can be easily adapted to estimate 3D bounding boxes of objects.

Definition of done

[ ] - Train and evaluate prototype model [ ] - Prepare the TensorRT inference engine and update shape_estimation package [ ] - Add model to the model zoo and prepare documentation about how to train

kaancolak commented 1 month ago

I developed two prototype models using PointNet and PointNet++. Each model is a combination of two sub-models: a regression transform net to estimate the object's center and a PointNet-based network to predict bounding box dimensions and yaw angles. The bounding box estimation part from the Frustum-PointNet paper was used as a reference.

The models take point clouds and labels as inputs.

The PointNet-based model can process around 2,000 objects using pure Python and a 3060 mobile GPU. By using TensorRT quantization and a more powerful GPU, it can process over 4,000 objects without performance issues. It looks applicable to Autoware.

Although the PointNet++ model is slightly more accurate, it is only half as fast. Pure Python version processing ~900 objects each second.

I will share based models that were used in the experiments after small refactoring. I used NuScenes 3D labels as a train and validatioın purpose.

I shared the evaluation results of the based model below;

Test metrics results: BEV Intersection over Union 2D: 0.8705733457868103 Intersection over Union 3D: 0.7107460938907424 Intersection over Union 3D > 0.7: 0.6023422951582867

Initial results:

https://github.com/autowarefoundation/autoware.universe/assets/12658936/918b71a7-637d-46c8-afc8-34c06a687344

FYI: @xmfcx @miursh @shmpwk @OsamuSekino

kaancolak commented 1 month ago

These is the up-to-date results of PointNet++ based method. Model trained using NuScenes dataset and tested over bus data. I'm planning to add custom data to train models, and after that model generalize the information it has learned.

https://github.com/autowarefoundation/autoware.universe/assets/12658936/1aa8bb97-c5eb-47c2-b2d0-8362190c4899

In a few weeks, I am planning to create TensorRT converter and inference node.

xmfcx commented 1 month ago

Could you retake this video with top down ortographic projection? These bounding boxes look wrong at 0:17 timestamp. Like, the box exceeds the front of the objects even though the front is visible by the lidar. But hard to tell from the projection.

kaancolak commented 1 month ago

Thank you @xmfcx , I re-recorded it, like you said, it looks misaligned in the previous video but there is no issue with that, also added a side view to clarify.

Blue boxes are the "shape_estimation" output and the green one's model prediction.

https://github.com/autowarefoundation/autoware.universe/assets/12658936/ba04b5df-bfbb-44f3-aceb-692ad50f2bf1

I think it looks better than the current algorithm but needs to be generalized using more data. Please feel free to share your ideas.

autowarefoundation / autoware.universe