SxJyJay / MSMDFusion

[CVPR 2023] MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection
Apache License 2.0
184 stars 11 forks source link
3d autonomous-driving cvpr2023 lidar-camera-fusion

MSMDFusion

Official implementation of our CVPR'2023 paper "MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection", by Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. MSMDFusion framework

Introduction

Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. This is challenging due to the difficulty of combining multi-granularity geometric and semantic features from two drastically different modalities. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as seeds) into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques. However, depth information is under-investigated in these approaches when lifting points into 3D space, thus 2D semantics can not be reliably fused with 3D points. Moreover, their multi-modal fusion strategy, which is implemented as concatenation or attention, either can not effectively fuse 2D and 3D information or is unable to perform fine-grained interactions in the voxel space. To this end, we propose a novel framework called MSMDFusion to tackle above problems.

Getting Started

Installation

For basic installation, please refer to getting_started.md for installation.

Notice:

Data Preparation

Step 1: Please refer to the official site for prepare nuscenes data. After data preparation, you will be able to see the following directory structure:

mmdetection3d
├── mmdet3d
├── tools
├── configs
├── data
│   ├── nuscenes
│   │   ├── maps
│   │   ├── samples
│   │   ├── sweeps
│   │   ├── v1.0-test
|   |   ├── v1.0-trainval
│   │   ├── nuscenes_database
│   │   ├── nuscenes_infos_train.pkl
│   │   ├── nuscenes_infos_val.pkl
│   │   ├── nuscenes_infos_test.pkl
│   │   ├── nuscenes_dbinfos_train.pkl

Step 2: Download preprocessed virtual points samples(extraction code: 9xcb) and sweeps(extraction code: 2eg1) data. And put them under the above folder samples and sweeps, respectively, and rename them as FOREGROUND_MIXED_6NN_WITH_DEPTH.

Training and Evaluation

For training, you need to first train a pure LiDAR backbone, such as TransFusion-L. Then, you can merge the checkpoints from pretrained TransFusion-L and ResNet-50 as suggested here. We also provide a merged 1-st stage checkpoint here(extraction code: 69i7)

# 1-st stage training
sh ./tools/dist_train.sh ./configs/transfusion_nusc_voxel_L.py 8
# 2-nd stage training
sh ./tools/dist_train.sh ./configs/MSMDFusion_nusc_voxel_LC.py 8

Notice: When training the 1-st stage of TransFusion-L, please follow the copy-and-paste fade strategy as suggested here.

For evaluation, you can use the following command:

# Evaluation
sh ./tools/dist_test.sh ./configs/MSMDFusion_nusc_voxel_LC.py $ckpt_path$ 8 --eval bbox

For testing and making a submission to the leaderboard, please refer to the official site

Results

3D Object Detection on nuScenes Model Set mAP NDS Result Files
MSMDFusion val 69.27 72.05 checkpoints
MSMDFusion test 71.49 73.96 predictions
MSMDFusion-TTA test 73.28 75.09 predictions
3D Object Tracking on nuScenes Model Set AMOTA AMOTP Recall Result Files
MSMDFusion test 73.98 54.87 76.30 predictions

Citation

If you find our paper useful, please cite:

@InProceedings{Jiao_2023_CVPR,
    author    = {Jiao, Yang and Jie, Zequn and Chen, Shaoxiang and Chen, Jingjing and Ma, Lin and Jiang, Yu-Gang},
    title     = {MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {21643-21652}
}

Acknowlegement

We sincerely thank the authors of mmdetection3d, CenterPoint, TransFusion, MVP, BEVFusion and BEVFusion for open sourcing their methods.