Haiyang-W / UniTR

[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"
https://arxiv.org/abs/2308.07732
Apache License 2.0
284 stars 16 forks source link
3d 3d-object-detection 3d-segmentation backbone bev camera computer-vision iccv2023 lidar multi-modal multi-view point-cloud transformer unified

UniTR: The First Unified Multi-modal Transformer Backbone for 3D Perception

This repo is the official implementation of ICCV2023 paper: UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation as well as the follow-ups. Our UniTR achieves state-of-the-art performance on nuScenes Dataset with a real unified and weight-sharing multi-modal (e.g., Cameras and LiDARs) backbone. UniTR is built upon the codebase of DSVT, we have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Haiyang Wang*, Hao Tang*, Shaoshuai Shi $^\dagger$, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang $^\dagger$

Contact: Haiyang Wang (wanghaiyang6@stu.pku.edu.cn), Hao Tang (tanghao@stu.pku.edu.cn), Shaoshuai Shi (shaoshuaics@gmail.com)

πŸš€ Gratitude to Tang Hao for extensive code refactoring and noteworthy contributions to open-source initiatives. His invaluable efforts were pivotal in ensuring the seamless completion of UniTR.

πŸ”₯ πŸ‘€ Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better strategies or some engineering efforts, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models.

πŸ“˜ I am going to share my understanding and future plan of the general 3D perception foundation model without reservation. Please refer to πŸ”₯ Potential ResearchπŸ”₯ . If you find it useful for your research or inspiring, feel free to join me in building this blueprint.

Interpretive Articles: [CVer] [θ‡ͺεŠ¨ι©Ύι©ΆδΉ‹εΏƒ] [ReadPaper] [ηŸ₯乎] [CSDN] [TechBeat (ε°†ι—¨εˆ›ζŠ•)]

News

Overview

TODO

Introduction

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data.

In this paper, we present an efficient multi-modal backbone for outdoor 3D perception, which processes a variety of modalities with unified modeling and shared parameters. It is a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 mIoU higher for BEV map segmentation with lower inference latency.

Main results

3D Object Detection (on NuScenes validation)

Model NDS mAP mATE mASE mAOE mAVE mAAE ckpt Log
UniTR 73.0 70.1 26.3 24.7 26.8 24.6 17.9 ckpt Log
UniTR+LSS 73.3 70.5 26.0 24.4 26.8 24.8 18.7 ckpt Log

3D Object Detection (on NuScenes test)

Model NDS mAP mATE mASE mAOE mAVE mAAE
UniTR 74.1 70.5 24.4 23.3 25.7 24.1 13.0
UniTR+LSS 74.5 70.9 24.1 22.9 25.6 24.0 13.1

Bev Map Segmentation (on NuScenes validation)

Model mIoU Drivable Ped.Cross. Walkway StopLine Carpark Divider ckpt Log
UniTR 73.2 90.4 73.1 78.2 66.6 67.3 63.8 ckpt Log
UniTR+LSS 74.7 90.7 74.0 79.3 68.2 72.9 64.2 ckpt Log

What's new here?

πŸ”₯ Beats previous SOTAs of outdoor multi-modal 3D Object Detection and BEV Segmentation

Our approach has achieved the best performance on multiple tasks (e.g., 3D Object Detection and BEV Map Segmentation), and it is highly versatile, requiring only the replacement of the backbone.

3D Object Detection
BEV Map Segmentation

πŸ”₯ Weight-Sharing among all modalities

We introduce a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps.

πŸ”₯ Prerequisite for 3D vision foundation models

A weight-shared unified multimodal encoder is a prerequisite for foundation models, especially in the context of 3D perception, unifying information from both images and LiDAR data. This is the first truly multimodal fusion backbone, seamlessly connecting to any 3D detection head.

Quick Start

Installation

conda create -n unitr python=3.8
# Install torch, we only test it in pytorch 1.10
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/torch_stable.html

git clone https://github.com/Haiyang-W/UniTR
cd UniTR

# Install extra dependency
pip install -r requirements.txt

# Install nuscenes-devkit
pip install nuscenes-devkit==1.0.5

# Develop
python setup.py develop

Dataset Preparation

OpenPCDet
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ nuscenes
β”‚   β”‚   │── v1.0-trainval (or v1.0-mini if you use mini)
β”‚   β”‚   β”‚   │── samples
β”‚   β”‚   β”‚   │── sweeps
β”‚   β”‚   β”‚   │── maps
β”‚   β”‚   β”‚   │── v1.0-trainval  
β”œβ”€β”€ pcdet
β”œβ”€β”€ tools
# Create dataset info file, lidar and image gt database
python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
    --version v1.0-trainval \
    --with_cam \
    --with_cam_gt \
    # --share_memory # if use share mem for lidar and image gt sampling (about 24G+143G or 12G+72G)
# share mem will greatly improve your training speed, but need 150G or 75G extra cache mem. 
# NOTE: all the experiments used share memory. Share mem will not affect performance

Training

Please download pretrained checkpoint from unitr_pretrain.pth and copy the file under the root folder, eg. UniTR/unitr_pretrain.pth. This file is the weight of pretraining DSVT on Imagenet and Nuimage datasets.

3D object detection:

# multi-gpu training
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000

## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000

BEV Map Segmentation:

# multi-gpu training
# note that we don't use image pretrain in BEV Map Segmentation
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --sync_bn --eval_map --logger_iter_interval 1000

## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --sync_bn --eval_map --logger_iter_interval 1000

Testing

3D object detection:

# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --ckpt <CHECKPOINT_FILE>

## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --ckpt <CHECKPOINT_FILE>

BEV Map Segmentation

# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --ckpt <CHECKPOINT_FILE> --eval_map

## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --ckpt <CHECKPOINT_FILE> --eval_map
# NOTE: evaluation results will not be logged in *.log, only be printed in the teminal

Cache Testing

add LSS

cache the mapping computation of multi-modal backbone

cd tools bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache.yaml --ckpt --batch_size 8

add LSS

cache the mapping computation of multi-modal backbone and LSS

cd tools bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache_plus.yaml --ckpt --batch_size 8

#### Performance of cache testing on NuScenes validation (some variations in camera parameters)
|  Model  | NDS | mAP |mATE | mASE | mAOE | mAVE| mAAE |
|---------|---------|--------|---------|---------|--------|---------|--------|
|  [UniTR (Cache Backbone)](https://github.com/Haiyang-W/UniTR/blob/main/tools/cfgs/nuscenes_models/unitr_cache.yaml) | 72.6(-0.4) | 69.4(-0.7) | 26.9 | 24.8 | 26.3 | 24.6 | 18.2 |
|  [UniTR+LSS (Cache Backbone)](https://github.com/Haiyang-W/UniTR/blob/main/tools/cfgs/nuscenes_models/unitr%2Blss_cache.yaml) | 73.1(-0.2) | 70.2(-0.3) | 25.8 | 24.4 | 26.0 | 25.3 | 18.2 | 
|  [UniTR+LSS (Cache Backbone and LSS)](https://github.com/Haiyang-W/UniTR/blob/main/tools/cfgs/nuscenes_models/unitr%2Blss_cache_plus.yaml) | 72.6(-0.7οΌ‰ | 69.3(-1.2οΌ‰ | 26.7 | 24.3 | 25.9 | 25.3 | 18.2 | 

## Potential Research
* **Infrastructure of 3D Vision Foundation Model.**
  An efficient network design is crucial for large models. With a reliable model structure, the development of large models can be advanced. How to make a general multimodal backbone more efficient and easy to deploy. Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better `partition strategies` or `some engineering efforts`, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models. 
* **Multi-Modal Self-supervised Learning based on Image-Lidar pair and UniTR.**
  Please refer to the following figure. The images and point clouds both describe the same 3D scene; they complement each other in terms of highly informative correspondence. This allows for the unsupervised learning of more generic scene representation with shared parameters.
* **Single-Modal Pretraining.** Our model is almost the same as ViT (except for some position embedding strategies). If we adjust the position embedding appropriately, DSVT and UniTR can directly load the pretrained parameters of ViT. This is beneficial for better integration with the 2D community.
* **Unifide Modeling of 3D Vision.**
  Please refer to the following figure. 
<div align="center">
  <img src="https://github.com/Haiyang-W/UniTR/raw/main/assets/Figure6.png" width="800"/>
</div>

## Possible Issues
* If you encounter a gradient that becomes NaN during fp16 training, not support.
* If you couldn’t find a solution, search open and closed issues in our github issues page [here](https://github.com/Haiyang-W/UniTR/issues).
* We provide torch checkpoints option [here](https://github.com/Haiyang-W/UniTR/blob/3f75dc1a362fe8f325dabd2e878ac57df2ab7323/tools/cfgs/nuscenes_models/unitr.yaml#L125) in training stage by default for saving CUDA memory 50%.
* Samples in Nuscenes have some variations in camera parameters. So, during training, every sample recalculates the camera-lidar mapping, which significantly slows down the training speed (~40%). If the extrinsic parameters in your dataset are consistent, I recommend caching this computation during training.
* If still no-luck, open a new issue in our github. Our turnaround is usually a couple of days.

## Citation
Please consider citing our work as follows if it is helpful.

@inproceedings{wang2023unitr, title={UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation}, author={Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang}, booktitle={ICCV}, year={2023} }



## Acknowledgments
UniTR uses code from a few open source repositories. Without the efforts of these folks (and their willingness to release their implementations), UniTR would not be possible. We thanks these authors for their efforts!
* Shaoshuai Shi: [OpenPCDet](https://github.com/open-mmlab/OpenPCDet)
* Chen Shi: [DSVT](https://github.com/Haiyang-W/DSVT)
* Zhijian Liu: [BevFusion](https://github.com/mit-han-lab/bevfusion)