MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
2.08k stars 129 forks source link

Some differences between the code and the paper #116

Open ydhongHIT opened 6 months ago

ydhongHIT commented 6 months ago

Hi, thanks for your great work! I have two small questions about the differences between the code and the paper. First, the base model is inferior to the small model in your arxiv paper. How did you solve the problem? Second, I see the usage of MLP in your code but it does not belong to the original mamba block (or the modified block in the paper). What are the effects of using MLP?

MzeroMiko commented 6 months ago
  1. We just used another drop_path to raise the performance. (We will try more hyperparameter settings in the future.)

  2. We used MLP mainly because the MLP is one of the most efficient operations (the highest throughput / FLOPs).

ydhongHIT commented 6 months ago
  1. We just used another drop_path to raise the performance. (We will try more hyperparameter settings in the future.)
  2. We used MLP mainly because the MLP is one of the most efficient operations (the highest throughput / FLOPs).

Thank you for your reply. Regarding to the second question, when you use MLP, the number of mamba blocks should be reduced by half to keep parameters unchanged. I'm curious as to what effect replacing half of the mamba blocks with MLPs would have on accuracy?

MzeroMiko commented 6 months ago

The accuracy falls for sure when just reduces the layers. So it is hard to keep balance between speed and performance.

ydhongHIT commented 6 months ago

The accuracy falls for sure when just reduces the layers. So it is hard to keep balance between speed and performance.

ydhongHIT commented 6 months ago

Hi, have you compared the training efficiency of vision mamba and vision transformer of the same size? I think Vmamba is several times less efficient than ViT. Do you think so too? Is there any solution?