Yes, you are right. And we're still updating our code, which will be faster.
:rocket: The History of Speed Up
Time is tested on 1xA100 with batch_size 128 for training; the config file is vssm1/vssm_tiny_224_0220.yaml. GPU memory is adopted from the log.
The experiments (arXiv 2401.10166) done before #20240119 used mamba-ssm + group-parallel.
The experiments done since #20240201 use sscore + fused cross scan + fused cross merge. We plan to use ssoflex + fused cross scan + fused cross merge + input16output32 in the future.
Yes, you are right. And we're still updating our code, which will be faster.
:rocket: The History of Speed Up
Time is tested on 1xA100 with batch_size 128 for
training
; the config file isvssm1/vssm_tiny_224_0220.yaml
. GPU memory is adopted from the log.The experiments (arXiv 2401.10166) done before #20240119 used
mamba-ssm + group-parallel
.The experiments done since #20240201 use
sscore + fused cross scan + fused cross merge
. We plan to usessoflex + fused cross scan + fused cross merge + input16output32
in the future.mamba-ssm + group parallel
25672M
0.4860s
sscore + fused csm
24984M
0.3930s
ssoflex + fused csm + i16o32
19842M
0.3650s
ssoflex + csm in triton + i16o32 + v4
19500M
0.2970s
mamba_ssm-1.1.3.post1+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
selective_scan_cuda_core
selective_scan_cuda_oflex
,oflex
means output flexiblecross scan
andcross merge
input fp16 + output fp32