MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
1.84k stars 100 forks source link

Three probs about VMamba #122

Open MDD-0928 opened 3 months ago

MDD-0928 commented 3 months ago

Dear authors:

  Thanks for your continuous work!!! I would like to consult you for three probs.

  First, I am so glad to see that you have updated the VMamba (VM) for higher **Throughput** by using "v4" & "LN2D"
          And I have tested the latest version of VM compared with ViT/B16 (ViT) on **Throughput**
           I found that the VM-Ting is much faster than ViT, which has a 1.5~1.7x Throughput compared with ViT.
           But it is a pity that VM-Small is a little bit slower than ViT, 
           and VM-Base is much slower than ViT by nearly 35%~40%
           So, is there any other possibility to fasten your VMamba , further improve the Throughput and make it a Truely high-accuracy & 
   high-throughput foundation model for  CV tasks ? It will be awesome 👍 

1

  Second, would you like to test your model's accuracy on Imagenet1k after pretrained on Imagenet21k and released the ckpt ? :)

  Third, since Mamba is more capable of Long Sequence,  I wonder that whether increasing the EMBED_DIM will boost the performance while incuring minor impact on throughput, e.g. increase the VM-Tiny's EMBED_DIM from 96 to 128 ? 
MzeroMiko commented 3 months ago
  1. The hierarchical structure is slower than vit for sure, but we do have comparable throughput with swin in torch2. We are still working on the way trying to make the model faster.

  2. We have no plan pretraining on imagenet21k now due to the limited resources, we may do it in the future.

  3. To be honest, hard to say. Raising the dimension nearly equals raising the batchsize in this implementation of selective scan, and it is not related to the seqlen. Also, change the embed dim may raise the performance though, it seems tricky.