MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
2.21k stars 143 forks source link

VMamba

VMamba: Visual State Space Model

[Yue Liu](https://github.com/MzeroMiko)1,[Yunjie Tian](https://sunsmarterjie.github.io/)1,[Yuzhong Zhao](https://scholar.google.com.hk/citations?user=tStQNm4AAAAJ&hl=zh-CN&oi=ao)1, [Hongtian Yu](https://github.com/yuhongtian17)1, [Lingxi Xie](https://scholar.google.com.hk/citations?user=EEMm7hwAAAAJ&hl=zh-CN&oi=ao)2, [Yaowei Wang](https://scholar.google.com.hk/citations?user=o_DllmIAAAAJ&hl=zh-CN&oi=ao)3, [Qixiang Ye](https://scholar.google.com.hk/citations?user=tjEfgsEAAAAJ&hl=zh-CN&oi=ao)1, [Yunfan Liu](https://scholar.google.com.hk/citations?user=YPL33G0AAAAJ&hl=zh-CN&oi=ao)1 1 University of Chinese Academy of Sciences, 2 HUAWEI Inc., 3 PengCheng Lab. Paper: ([arXiv 2401.10166](https://arxiv.org/abs/2401.10166))

:white_check_mark: Updates

for details see detailed_updates.md

Abstract

Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments showcase VMamba’s promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models.

Overview

architecture

arch

erf

Main Results

:book: For details see performance.md.

Classification on ImageNet-1K

name pretrain resolution acc@1 #params FLOPs TP. Train TP. configs/logs/ckpts
Swin-T ImageNet-1K 224x224 81.2 28M 4.5G 1244 987 --
Swin-S ImageNet-1K 224x224 83.2 50M 8.7G 718 642 --
Swin-B ImageNet-1K 224x224 83.5 88M 15.4G 458 496 --
VMamba-S[s2l15] ImageNet-1K 224x224 83.6 50M 8.7G 877 314 config/log/ckpt
VMamba-B[s2l15] ImageNet-1K 224x224 83.9 89M 15.4G 646 247 config/log/ckpt
VMamba-T[s1l8] ImageNet-1K 224x224 82.6 30M 4.9G 1686 571 config/log/ckpt

Object Detection on COCO

Backbone #params FLOPs Detector bboxAP bboxAP50 bboxAP75 segmAP segmAP50 segmAP75 configs/logs/ckpts
Swin-T 48M 267G MaskRCNN@1x 42.7 65.2 46.8 39.3 62.2 42.2 --
Swin-S 69M 354G MaskRCNN@1x 44.8 66.6 48.9 40.9 63.4 44.2 -- --
Swin-B 107M 496G MaskRCNN@1x 46.9 -- -- 42.3 -- -- -- --
VMamba-S[s2l15] 70M 384G MaskRCNN@1x 48.7 70.0 53.4 43.7 67.3 47.0 config/log/ckpt
VMamba-B[s2l15] 108M 485G MaskRCNN@1x 49.2 71.4 54.0 44.1 68.3 47.7 config/log/ckpt
VMamba-B[s2l15] 108M 485G MaskRCNN@1x[bs8] 49.2 70.9 53.9 43.9 67.7 47.6 config/log/ckpt
VMamba-T[s1l8] 50M 271G MaskRCNN@1x 47.3 69.3 52.0 42.7 66.4 45.9 config/log/ckpt
:---: :---: :---: :---: :---: :---: :---: :---: :---: :---: :---: :---: :---:
Swin-T 48M 267G MaskRCNN@3x 46.0 68.1 50.3 41.6 65.1 44.9 --
Swin-S 69M 354G MaskRCNN@3x 48.2 69.8 52.8 43.2 67.0 46.1 --
VMamba-S[s2l15] 70M 384G MaskRCNN@3x 49.9 70.9 54.7 44.20 68.2 47.7 config/log/ckpt
VMamba-T[s1l8] 50M 271G MaskRCNN@3x 48.8 70.4 53.50 43.7 67.4 47.0 config/log/ckpt

Semantic Segmentation on ADE20K

Backbone Input #params FLOPs Segmentor mIoU(SS) mIoU(MS) configs/logs/logs(ms)/ckpts
Swin-T 512x512 60M 945G UperNet@160k 44.4 45.8 --
Swin-S 512x512 81M 1039G UperNet@160k 47.6 49.5 --
Swin-B 512x512 121M 1188G UperNet@160k 48.1 49.7 --
VMamba-S[s2l15] 512x512 82M 1028G UperNet@160k 50.6 51.2 config/log/log(ms)/ckpt
VMamba-B[s2l15] 512x512 122M 1170G UperNet@160k 51.0 51.6 config/log/log(ms)/ckpt
VMamba-T[s1l8] 512x512 62M 949G UperNet@160k 47.9 48.8 config/log/log(ms)/ckpt

Getting Started

Installation

Step 1: Clone the VMamba repository:

To get started, first clone the VMamba repository and navigate to the project directory:

git clone https://github.com/MzeroMiko/VMamba.git
cd VMamba

Step 2: Environment Setup:

VMamba recommends setting up a conda environment and installing dependencies via pip. Use the following commands to set up your environment: Also, We recommend using the pytorch>=2.0, cuda>=11.8. But lower version of pytorch and CUDA are also supported.

Create and activate a new conda environment

conda create -n vmamba
conda activate vmamba

Install Dependencies

pip install -r requirements.txt
cd kernels/selective_scan && pip install .

Check Selective Scan (optional)

Dependencies for Detection and Segmentation (optional)

pip install mmengine==0.10.1 mmcv==2.1.0 opencv-python-headless ftfy regex
pip install mmdet==3.3.0 mmsegmentation==1.2.2 mmpretrain==1.2.0

Model Training and Inference

Classification

To train VMamba models for classification on ImageNet, use the following commands for different configurations:

python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=8 --master_addr="127.0.0.1" --master_port=29501 main.py --cfg </path/to/config> --batch-size 128 --data-path </path/of/dataset> --output /tmp

If you only want to test the performance (together with params and flops):

python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=1 --master_addr="127.0.0.1" --master_port=29501 main.py --cfg </path/to/config> --batch-size 128 --data-path </path/of/dataset> --output /tmp --pretrained </path/of/checkpoint>

please refer to modelcard for more details.

Detection and Segmentation

To evaluate with mmdetection or mmsegmentation:

bash ./tools/dist_test.sh </path/to/config> </path/to/checkpoint> 1

use --tta to get the mIoU(ms) in segmentation

To train with mmdetection or mmsegmentation:

bash ./tools/dist_train.sh </path/to/config> 8

For more information about detection and segmentation tasks, please refer to the manual of mmdetection and mmsegmentation. Remember to use the appropriate backbone configurations in the configs directory.

Analysis Tools

VMamba includes tools for visualizing mamba "attention" and effective receptive field, analysing throughput and train-throughput. Use the following commands to perform analysis:

# Visualize Mamba "Attention"
CUDA_VISIBLE_DEVICES=0 python analyze/attnmap.py

# Analyze the effective receptive field
CUDA_VISIBLE_DEVICES=0 python analyze/erf.py

# Analyze the throughput and train throughput
CUDA_VISIBLE_DEVICES=0 python analyze/tp.py

We also included other analysing tools that we may use in this project. Thanks to all who have contributes to these tools.

Star History

Star History Chart

Citation

@article{liu2024vmamba,
  title={VMamba: Visual State Space Model},
  author={Liu, Yue and Tian, Yunjie and Zhao, Yuzhong and Yu, Hongtian and Xie, Lingxi and Wang, Yaowei and Ye, Qixiang and Liu, Yunfan},
  journal={arXiv preprint arXiv:2401.10166},
  year={2024}
}

Acknowledgment

This project is based on Mamba (paper, code), Swin-Transformer (paper, code), ConvNeXt (paper, code), OpenMMLab, and the analyze/get_erf.py is adopted from replknet, thanks for their excellent works.