What's the core difference between the selective_scan package v1 and v2? About the original mamba_ssm, would v2 be faster or something else?

Yes, you are right. And we're still updating our code, which will be faster.

:rocket: The History of Speed Up

Time is tested on 1xA100 with batch_size 128 for training; the config file is vssm1/vssm_tiny_224_0220.yaml. GPU memory is adopted from the log.

The experiments (arXiv 2401.10166) done before #20240119 used mamba-ssm + group-parallel.

The experiments done since #20240201 use sscore + fused cross scan + fused cross merge. We plan to use ssoflex + fused cross scan + fused cross merge + input16output32 in the future.

name	GPU Memory	time (s/iter)
mamba-ssm + sequence scan	25927M	0.6585s
`mamba-ssm + group parallel`	`25672M`	`0.4860s`
mamba-ssm + float16	20439M	0.4195s
mamba-ssm + fused cross scan	25675M	0.4820s
mamba-ssm + fused csm	25596M	0.4020s
`sscore + fused csm`	`24984M`	`0.3930s`
sscore + fused csm + forward nrow	24984M	0.4090s
sscore + fused csm + backward nrow	24984M	0.4490s
sscore + fused csm + forward nrow + backward nrow	24984M	0.4640s
ssoflex + fused csm	24986M	0.3940s
`ssoflex + fused csm + i16o32`	`19842M`	`0.3650s`
ssoflex + csm in triton + i16o32	19888M	0.3610s
`ssoflex + csm in triton + i16o32 + v4`	`19500M`	`0.2970s`

mamba-ssm: mamba_ssm-1.1.3.post1+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
sscore: selective_scan_cuda_core
ssoflex: selective_scan_cuda_oflex, oflex means output flexible
csm: cross scan and cross merge
i16o32: input fp16 + output fp32

MzeroMiko / VMamba

What's the core difference between the selective_scan package v1 and v2? About the original mamba_ssm, would v2 be faster or something else? #96

:rocket: The History of Speed Up