MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
2.11k stars 132 forks source link

What's the core difference between the selective_scan package v1 and v2? About the original mamba_ssm, would v2 be faster or something else? #96

Open 924973292 opened 7 months ago

MzeroMiko commented 7 months ago

Yes, you are right. And we're still updating our code, which will be faster.

:rocket: The History of Speed Up

Time is tested on 1xA100 with batch_size 128 for training; the config file is vssm1/vssm_tiny_224_0220.yaml. GPU memory is adopted from the log.

The experiments (arXiv 2401.10166) done before #20240119 used mamba-ssm + group-parallel.

The experiments done since #20240201 use sscore + fused cross scan + fused cross merge. We plan to use ssoflex + fused cross scan + fused cross merge + input16output32 in the future.

name GPU Memory time (s/iter)
mamba-ssm + sequence scan 25927M 0.6585s
mamba-ssm + group parallel 25672M 0.4860s
mamba-ssm + float16 20439M 0.4195s
mamba-ssm + fused cross scan 25675M 0.4820s
mamba-ssm + fused csm 25596M 0.4020s
sscore + fused csm 24984M 0.3930s
sscore + fused csm + forward nrow 24984M 0.4090s
sscore + fused csm + backward nrow 24984M 0.4490s
sscore + fused csm + forward nrow + backward nrow 24984M 0.4640s
ssoflex + fused csm 24986M 0.3940s
ssoflex + fused csm + i16o32 19842M 0.3650s
ssoflex + csm in triton + i16o32 19888M 0.3610s
ssoflex + csm in triton + i16o32 + v4 19500M 0.2970s
924973292 commented 7 months ago

Thanks for your detailed reply!