Daniel-xsy/RoboBEV - Githubissues

Benchmarking and Improving Bird's Eye View Perception Robustness
in Autonomous Driving

Shaoyuan Xie¹ Lingdong Kong^2,3 Wenwei Zhang^2,4 Jiawei Ren⁴ Liang Pan² Kai Chen² Ziwei Liu⁴
¹University of California, Irvine ²Shanghai AI Laboratory ³National University of Singapore ⁴S-Lab, Nanyang Technological University

About

RoboBEV is the first robustness evaluation benchmark tailored for camera-based bird's eye view (BEV) perception under natural data corruption and domain shift, which are cases that have a high likelihood to occur in real-world deployments.

[Common Corruption] - We investigate eight data corruption types that are likely to appear in driving scenarios, ranging from ¹sensor failure, ²motion & data processing, ³lighting conditions, and ⁴weather conditions.

[Domain Shift] - We benchmark the adaptation performance of BEV models from three aspects, including ¹city-to-city, ²day-to-night, and ³dry-to-rain.


FRONT_LEFT	FRONT	FRONT_RIGHT	FRONT_LEFT	FRONT	FRONT_RIGHT


BACK_LEFT	BACK	BACK_RIGHT	BACK_LEFT	BACK	BACK_RIGHT

Visit our project page to explore more examples. :blue_car:

Updates

[2024.06] - Check out our updated paper for robust BEV perception: Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving. :fuelpump:
[2024.05] - Check out the technical report of this competition: The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition. :blue_car:
[2024.05] - The slides of the 2024 RoboDrive Workshop are available here :arrow_heading_up:.
[2024.05] - The video recordings are available on YouTube :arrow_heading_up: and Bilibili :arrow_heading_up:.
[2024.05] - We are glad to announce the winning teams of the 2024 RoboDrive Challenge:
- Track 1: Robust BEV Detection
- :1st_place_medal: DeepVision, :2nd_place_medal: Ponyville Autonauts Ltd, :3rd_place_medal: CyberBEV
- Track 2: Robust Map Segmentation
- :1st_place_medal: SafeDrive-SSR, :2nd_place_medal: CrazyFriday, :3rd_place_medal: Samsung Research
- Track 3: Robust Occupancy Prediction
- :1st_place_medal: ViewFormer, :2nd_place_medal: APEC Blue, :3rd_place_medal: hm.unilab
- Track 4: Robust Depth Estimation
- :1st_place_medal: HIT-AIIA, :2nd_place_medal: BUAA-Trans, :3rd_place_medal: CUSTZS
- Track 5: Robust Multi-Modal BEV Detection
- :1st_place_medal: safedrive-promax, :2nd_place_medal: Ponyville Autonauts Ltd, :3rd_place_medal: HITSZrobodrive
[2024.01] - The toolkit tailored for the 2024 RoboDrive Challenge has been released. :hammer_and_wrench:
[2023.12] - We are hosting the RoboDrive Challenge at ICRA 2024. :blue_car:
[2023.06] - The nuScenes-C dataset is now available at OpenDataLab! 🚀
[2023.04] - We establish "Robust BEV Perception" leaderboards on Paper-with-Code. Join the challenge today! :raising_hand:
[2023.02] - We invite every BEV enthusiast to participate in the robust BEV perception benchmark! For more details, please read this page. :beers:
[2023.01] - Launch of RoboBEV! In this initial version, 11 BEV detection algorithms and 1 monocular 3D detection algorithm have been benchmarked under 8 corruption types across 3 severity levels.

Installation

Kindly refer to INSTALL.md for the installation details.

Data Preparation

Our datasets are hosted by OpenDataLab.

OpenDataLab is a pioneering open data platform for the large AI model era, making datasets accessible. By using OpenDataLab, researchers can obtain free formatted datasets in various fields.

Kindly refer to DATA_PREPARE.md for the details to prepare the nuScenes and nuScenes-C datasets.

Getting Started

Kindly refer to GET_STARTED.md to learn more usage about this codebase.

Model Zoo

Camera-Only BEV Detection

> - [ ] **[Fast-BEV](https://arxiv.org/abs/2301.12511), arXiv 2023.** ^{[**`[Code]`**](https://github.com/Sense-GVT/Fast-BEV)} > - [ ] **[AeDet](https://arxiv.org/abs/2211.12501), CVPR 2023.** ^{[**`[Code]`**](https://github.com/fcjian/AeDet)} > - [x] **[SOLOFusion](https://arxiv.org/abs/2210.02443), ICLR 2023.** ^{[**`[Code]`**](https://github.com/Divadi/SOLOFusion)} > - [x] **[PolarFormer](https://arxiv.org/abs/2206.15398), AAAI 2023.** ^{[**`[Code]`**](https://github.com/fudan-zvg/PolarFormer)} > - [x] **[BEVStereo](https://arxiv.org/abs/2209.10248), AAAI 2023.** ^{[**`[Code]`**](https://github.com/Megvii-BaseDetection/BEVStereo)} > - [x] **[BEVDepth](https://arxiv.org/abs/2206.10092), AAAI 2023.** ^{[**`[Code]`**](https://github.com/Megvii-BaseDetection/BEVDepth)} > - [ ] **[MatrixVT](https://arxiv.org/abs/2211.10593), arXiv 2022.** ^{[**`[Code]`**](https://github.com/Megvii-BaseDetection/BEVDepth)} > - [x] **[Sparse4D](https://arxiv.org/abs/2211.10581), arXiv 2022.** ^{[**`[Code]`**](https://github.com/linxuewu/Sparse4D)} > - [ ] **[CrossDTR](https://arxiv.org/abs/2209.13507), arXiv 2022.** ^{[**`[Code]`**](https://github.com/sty61010/CrossDTR)} > - [x] **[SRCN3D](https://arxiv.org/abs/2206.14451), arXiv 2022.** ^{[**`[Code]`**](https://github.com/synsin0/SRCN3D)} > - [ ] **[PolarDETR](https://arxiv.org/abs/2206.10965), arXiv 2022.** ^{[**`[Code]`**](https://github.com/hustvl/PolarDETR)} > - [x] **[BEVerse](https://arxiv.org/abs/2205.09743), arXiv 2022.** ^{[**`[Code]`**](https://github.com/zhangyp15/BEVerse)} > - [ ] **[M^2BEV](https://arxiv.org/abs/2204.05088), arXiv 2022.** ^{[**`[Code]`**](https://nvlabs.github.io/M2BEV/)} > - [x] **[ORA3D](https://arxiv.org/abs/2207.00865), BMVC 2022.** ^{[**`[Code]`**](https://github.com/anonymous2776/ora3d)} > - [ ] **[Graph-DETR3D](https://arxiv.org/abs/2204.11582), ACM MM 2022.** ^{[**`[Code]`**](https://github.com/zehuichen123/Graph-DETR3D)} > - [ ] **[SpatialDETR](https://markus-enzweiler.de/downloads/publications/ECCV2022-spatial_detr.pdf), ECCV 2022.** ^{[**`[Code]`**](https://github.com/cgtuebingen/SpatialDETR)} > - [x] **[PETR](https://arxiv.org/abs/2203.05625), ECCV 2022.** ^{[**`[Code]`**](https://github.com/megvii-research/PETR)} > - [x] **[BEVFormer](https://arxiv.org/abs/2203.17270), ECCV 2022.** ^{[**`[Code]`**](https://github.com/fundamentalvision/BEVFormer)} > - [x] **[BEVDet](https://arxiv.org/abs/2112.11790), arXiv 2021.** ^{[**`[Code]`**](https://github.com/HuangJunJie2017/BEVDet)} > - [x] **[DETR3D](https://arxiv.org/abs/2110.06922), CoRL 2021.** ^{[**`[Code]`**](https://github.com/WangYueFt/detr3d)}

Camera-Only Monocular 3D Detection

> - [x] **[FCOS3D](https://openaccess.thecvf.com/content/ICCV2021W/3DODI/html/Wang_FCOS3D_Fully_Convolutional_One-Stage_Monocular_3D_Object_Detection_ICCVW_2021_paper.html), ICCVW 2021.** ^{[**`[Code]`**](https://github.com/open-mmlab/mmdetection3d)}

LiDAR-Camera Fusion BEV Detection

> - [ ] **[BEVDistill](https://arxiv.org/abs/2211.09386), ICLR 2023.** ^{[**`[Code]`**](https://github.com/zehuichen123/BEVDistill)} > - [x] **[BEVFusion](https://arxiv.org/abs/2205.13542), ICRA 2023.** ^{[**`[Code]`**](https://github.com/mit-han-lab/bevfusion)} > - [ ] **[BEVFusion](https://arxiv.org/abs/2205.13790), NeurIPS 2022.** ^{[**`[Code]`**](https://github.com/ADLab-AutoDrive/BEVFusion)} > - [x] **[TransFusion](https://openaccess.thecvf.com/content/CVPR2022/papers/Bai_TransFusion_Robust_LiDAR-Camera_Fusion_for_3D_Object_Detection_With_Transformers_CVPR_2022_paper.pdf), CVPR 2022.** ^{[**`[Code]`**](https://github.com/XuyangBai/TransFusion)} > - [x] **[AutoAlignV2](https://arxiv.org/abs/2207.10316), ECCV 2022.** ^{[**`[Code]`**](https://github.com/zehuichen123/AutoAlignV2)}

Camera-Only BEV Map Segmentation

> - [ ] **[LaRa](https://arxiv.org/abs/2206.13294), CoRL 2022.** ^{[**`[Code]`**](https://github.com/valeoai/LaRa)} > - [x] **[CVT](https://arxiv.org/abs/2205.02833), CVPR 2022.** ^{[**`[Code]`**](https://github.com/bradyz/cross_view_transformers)}

Multi-Camera Depth Estimation

> - [x] **[SurroundDepth](https://arxiv.org/abs/2204.03636), CoRL 2022.** ^{[**`[Code]`**](https://github.com/weiyithu/SurroundDepth)}

Multi-Camera Semantic Occupancy Prediction

> - [x] **[SurroundOcc](), arXiv 2023.** ^{[**`[Code]`**](https://github.com/weiyithu/SurroundOcc)} > - [x] **[TPVFormer](https://arxiv.org/abs/2302.07817), CVPR, 2023.** ^{[**`[Code]`**](https://github.com/wzzheng/TPVFormer)}

Robustness Benchmark

:triangular_ruler: Metrics: The nuScenes Detection Score (NDS) is consistently used as the main indicator for evaluating model performance in our benchmark. The following two metrics are adopted to compare between models' robustness:

mCE (the lower the better): The average corruption error (in percentage) of a candidate model compared to the baseline model, which is calculated among all corruption types across three severity levels.
mRR (the higher the better): The average resilience rate (in percentage) of a candidate model compared to its "clean" performance, which is calculated among all corruption types across three severity levels.

:gear: Notation: Symbol ^:star: denotes the baseline model adopted in mCE calculation. For more detailed experimental results, please refer to RESULTS.md.

BEV Detection

Model	mCE (%) $\downarrow$	mRR (%) $\uparrow$	Clean	Cam Crash	Frame Lost	Color Quant	Motion Blur	Bright	Low Light	Fog	Snow
DETR3D^:star:	100.00	70.77	0.4224	0.2859	0.2604	0.3177	0.2661	0.4002	0.2786	0.3912	0.1913

DETR3D_CBGS	99.21	70.02	0.4341	0.2991	0.2685	0.3235	0.2542	0.4154	0.2766	0.4020	0.1925
BEVFormer_Small	101.23	59.07	0.4787	0.2771	0.2459	0.3275	0.2570	0.3741	0.2413	0.3583	0.1809
BEVFormer_Base	97.97	60.40	0.5174	0.3154	0.3017	0.3509	0.2695	0.4184	0.2515	0.4069	0.1857
PETR_R50-p4	111.01	61.26	0.3665	0.2320	0.2166	0.2472	0.2299	0.2841	0.1571	0.2876	0.1417
PETR_VoV-p4	100.69	65.03	0.4550	0.2924	0.2792	0.2968	0.2490	0.3858	0.2305	0.3703	0.2632
ORA3D	99.17	68.63	0.4436	0.3055	0.2750	0.3360	0.2647	0.4075	0.2613	0.3959	0.1898
BEVDet_R50	115.12	51.83	0.3770	0.2486	0.1924	0.2408	0.2061	0.2565	0.1102	0.2461	0.0625
BEVDet_R101	113.68	53.12	0.3877	0.2622	0.2065	0.2546	0.2265	0.2554	0.1118	0.2495	0.0810
BEVDet_R101-pt	112.80	56.35	0.3780	0.2442	0.1962	0.3041	0.2590	0.2599	0.1398	0.2073	0.0939
BEVDet_SwinT	116.48	46.26	0.4037	0.2609	0.2115	0.2278	0.2128	0.2191	0.0490	0.2450	0.0680
BEVDepth_R50	110.02	56.82	0.4058	0.2638	0.2141	0.2751	0.2513	0.2879	0.1757	0.2903	0.0863
BEVerse_SwinT	110.67	48.60	0.4665	0.3181	0.3037	0.2600	0.2647	0.2656	0.0593	0.2781	0.0644
BEVerse_SwinS	117.82	49.57	0.4951	0.3364	0.2485	0.2807	0.2632	0.3394	0.1118	0.2849	0.0985
PolarFormer_R101	96.06	70.88	0.4602	0.3133	0.2808	0.3509	0.3221	0.4304	0.2554	0.4262	0.2304
PolarFormer_VoV	98.75	67.51	0.4558	0.3135	0.2811	0.3076	0.2344	0.4280	0.2441	0.4061	0.2468
SRCN3D_R101	99.67	70.23	0.4286	0.2947	0.2681	0.3318	0.2609	0.4074	0.2590	0.3940	0.1920
SRCN3D_VoV	102.04	67.95	0.4205	0.2875	0.2579	0.2827	0.2143	0.3886	0.2274	0.3774	0.2499
Sparse4D_R101	100.01	55.04	0.5438	0.2873	0.2611	0.3310	0.2514	0.3984	0.2510	0.3884	0.2259
SOLOFusion_short	108.68	61.45	0.3907	0.2541	0.2195	0.2804	0.2603	0.2966	0.2033	0.2998	0.1066
SOLOFusion_long	97.99	64.42	0.4850	0.3159	0.2490	0.3598	0.3460	0.4002	0.2814	0.3991	0.1480
SOLOFusion_fusion	92.86	64.53	0.5381	0.3806	0.3464	0.4058	0.3642	0.4329	0.2626	0.4480	0.1376

FCOS3D_finetune	107.82	62.09	0.3949	0.2849	0.2479	0.2574	0.2570	0.3218	0.1468	0.3321	0.1136

BEVFusion_Cam	109.02	57.81	0.4121	0.2777	0.2255	0.2763	0.2788	0.2902	0.1076	0.3041	0.1461
BEVFusion_LiDAR	-	-	0.6928	-	-	-	-	-	-	-	-
BEVFusion_C+L	43.80	97.41	0.7138	0.6963	0.6931	0.7044	0.6977	0.7018	0.6787	-	-
TransFusion	-	-	0.6887	0.6843	0.6447	0.6819	0.6749	0.6843	0.6663	-	-
AutoAlignV2	-	-	0.6139	0.5849	0.5832	0.6006	0.5901	0.6076	0.5770	-	-

Multi-Camera Depth Estimation

Model	Metric	Clean	Cam Crash	Frame Lost	Color Quant	Motion Blur	Bright	Low Light	Fog	Snow
SurroundDepth	Abs Rel	0.280	0.485	0.497	0.334	0.338	0.339	0.354	0.320	0.423

Multi-Camera Semantic Occupancy Prediction

Model	Metric	Clean	Cam Crash	Frame Lost	Color Quant	Motion Blur	Bright	Low Light	Fog	Snow
TPVFormer	mIoU vox	52.06	27.39	22.85	38.16	38.64	49.00	37.38	46.69	19.39
SurroundOcc	SC mIoU	20.30	11.60	10.00	14.03	12.41	19.18	12.15	18.42	7.39

BEV Model Calibration

Model	Pretrain	Temporal	Depth	CBGS	Backbone	Encoder_BEV	Input Size	mCE (%)	mRR (%)	NDS
DETR3D	✓	✗	✗	✗	ResNet	Attention	1600×900	100.00	70.77	0.4224
DETR3D_CBGS	✓	✗	✗	✓	ResNet	Attention	1600×900	99.21	70.02	0.4341
BEVFormer_Small	✓	✓	✗	✗	ResNet	Attention	1280×720	101.23	59.07	0.4787
BEVFormer_Base	✓	✓	✗	✗	ResNet	Attention	1600×900	97.97	60.40	0.5174
PETR_R50-p4	✗	✗	✗	✗	ResNet	Attention	1408×512	111.01	61.26	0.3665
PETR_VoV-p4	✓	✗	✗	✗	VoVNet_V2	Attention	1600×900	100.69	65.03	0.4550
ORA3D	✓	✗	✗	✗	ResNet	Attention	1600×900	99.17	68.63	0.4436
PolarFormer_R101	✓	✗	✗	✗	ResNet	Attention	1600×900	96.06	70.88	0.4602
PolarFormer_VoV	✓	✗	✗	✗	VoVNet_V2	Attention	1600×900	98.75	67.51	0.4558

SRCN3D_R101	✓	✗	✗	✗	ResNet	CNN+Attn.	1600×900	99.67	70.23	0.4286
SRCN3D_VoV	✓	✗	✗	✗	VoVNet_V2	CNN+Attn.	1600×900	102.04	67.95	0.4205
Sparse4D_R101	✓	✓	✗	✗	ResNet	CNN+Attn.	1600×900	100.01	55.04	0.5438

BEVDet_R50	✗	✗	✓	✓	ResNet	CNN	704×256	115.12	51.83	0.3770
BEVDet_R101	✗	✗	✓	✓	ResNet	CNN	704×256	113.68	53.12	0.3877
BEVDet_R101-pt	✓	✗	✓	✓	ResNet	CNN	704×256	112.80	56.35	0.3780
BEVDet_SwinT	✗	✗	✓	✓	Swin	CNN	704×256	116.48	46.26	0.4037
BEVDepth_R50	✗	✗	✓	✓	ResNet	CNN	704×256	110.02	56.82	0.4058
BEVerse_SwinT	✗	✗	✓	✓	Swin	CNN	704×256	137.25	28.24	0.1603
BEVerse_SwinT	✗	✓	✓	✓	Swin	CNN	704×256	110.67	48.60	0.4665
BEVerse_SwinS	✗	✗	✓	✓	Swin	CNN	1408×512	132.13	29.54	0.2682
BEVerse_SwinS	✗	✓	✓	✓	Swin	CNN	1408×512	117.82	49.57	0.4951
SOLOFusion_short	✗	✓	✓	✗	ResNet	CNN	704×256	108.68	61.45	0.3907
SOLOFusion_long	✗	✓	✓	✗	ResNet	CNN	704×256	97.99	64.42	0.4850
SOLOFusion_fusion	✗	✓	✓	✓	ResNet	CNN	704×256	92.86	64.53	0.5381

Note: Pretrain denotes models initialized from the FCOS3D checkpoint. Temporal indicates whether temporal information is used. Depth denotes models with an explicit depth estimation branch. CBGS highlight models use the class-balanced group-sampling strategy.

Create Corruption Set

You can manage to create your own "RoboBEV" corrpution sets! Follow the instructions listed in CREATE.md.

TODO List

[x] Initial release. 🚀
[x] Add scripts for creating common corruptions.
[x] Add download link of nuScenes-C.
[x] Add evaluation scripts on corruption sets.
[x] Establish benchmark for BEV map segmentation.
[x] Establish benchmark for multi-camera depth estimation.
[x] Establish benchmark for multi-camera semantic occupancy prediction.
[ ] ...

Citation

If you find this work helpful, please kindly consider citing the following:

@article{xie2024benchmarking,
    title = {Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving},
    author = {Xie, Shaoyuan and Kong, Lingdong and Zhang, Wenwei and Ren, Jiawei and Pan, Liang and Chen, Kai and Liu, Ziwei},
    journal = {arXiv preprint arXiv:2405.17426}, 
    year = {2024}
}

@article{xie2023robobev,
    title = {RoboBEV: Towards Robust Bird's Eye View Perception under Corruptions},
    author = {Xie, Shaoyuan and Kong, Lingdong and Zhang, Wenwei and Ren, Jiawei and Pan, Liang and Chen, Kai and Liu, Ziwei},
    journal = {arXiv preprint arXiv:2304.06719}, 
    year = {2023}
}

@misc{xie2023robobev_codebase,
    title = {The RoboBEV Benchmark for Robust Bird's Eye View Detection under Common Corruption and Domain Shift},
    author = {Xie, Shaoyuan and Kong, Lingdong and Zhang, Wenwei and Ren, Jiawei and Pan, Liang and Chen, Kai and Liu, Ziwei},
    howpublished = {\url{https://github.com/Daniel-xsy/RoboBEV}},
    year = {2023}
}

License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, while some specific operations in this codebase might be with other licenses. Please refer to LICENSE.md for a more careful check, if you are using our code for commercial matters.

Acknowledgements

This work is developed based on the MMDetection3D codebase.

MMDetection3D is an open source object detection toolbox based on PyTorch, towards the next-generation platform for general 3D detection. It is a part of the OpenMMLab project developed by MMLab.

:heart: We thank Jiangmiao Pang and Tai Wang for their insightful discussions and feedback. We thank the OpenDataLab platform for hosting our datasets.