About the inference speed

MzeroMiko / VMamba

VMamba: Visual State Space Models，code is based on mamba

MIT License

1.82k stars 98 forks source link

About the inference speed #228

Open aifeixingdelv opened 3 weeks ago

aifeixingdelv commented 3 weeks ago

Thanks for your nice contribution!! When I try to replace the Transformer block in a model with VSSEncoder（The Transformer includes factorized self-attention for its linear complexity as done in CoaT，a paper titled “Co-Scale Conv-Attentional Image Transformers”，）， I find if the params is similar，models with VSSencoder has higher FLOPs，such as Params： 5.843736M（vit） vs 5.910544M（vmamba） FLOPs：1.654163812G（vit）vs 2.754769032G（vmamba） So，I want to know the advantages of Vmamba over other transformer models with near linear complexity? If， the Vmamba can get faster inference speed than other transformer models with near linear complexity?

aifeixingdelv commented 3 weeks ago

I do an inference speed experiment about above two kinds of models. Models using VSSencoder has the similiar params num but lager GFLOPs than the model using Transfomer encoder variants. However，their inference speed or hz is very very near. I am curious, shouldn't smaller GFLOPs actually result in faster inference?

MzeroMiko commented 3 weeks ago

Compared to linear transformers，mamba based model may have better interpretability and better performance. But with no advantage in flops or inference speed.

The inference speed is also related to the way it implements. For example，the flops of mamba could be smaller if we use a vanilla for loop to implement the state transfer，but the author of mamba finally chose to double the flops of that procedure to implement it in a more parallel and more efficient manner.

aifeixingdelv commented 3 weeks ago

Thanks for your reply!!

aifeixingdelv commented 3 weeks ago

Compared to linear transformers，mamba based model may have better interpretability and better performance. But with no advantage in flops or inference speed.

The inference speed is also related to the way it implements. For example，the flops of mamba could be smaller if we use a vanilla for loop to implement the state transfer，but the author of mamba finally chose to double the flops of that procedure to implement it in a more parallel and more efficient manner.

Do you feel vmamba will occupy more gpu memory than resnet or transformer when training?