Open ydhongHIT opened 6 months ago
We just used another drop_path to raise the performance. (We will try more hyperparameter settings in the future.)
We used MLP mainly because the MLP is one of the most efficient operations (the highest throughput / FLOPs).
- We just used another drop_path to raise the performance. (We will try more hyperparameter settings in the future.)
- We used MLP mainly because the MLP is one of the most efficient operations (the highest throughput / FLOPs).
Thank you for your reply. Regarding to the second question, when you use MLP, the number of mamba blocks should be reduced by half to keep parameters unchanged. I'm curious as to what effect replacing half of the mamba blocks with MLPs would have on accuracy?
The accuracy falls for sure when just reduces the layers. So it is hard to keep balance between speed and performance.
The accuracy falls for sure when just reduces the layers. So it is hard to keep balance between speed and performance.
Hi, have you compared the training efficiency of vision mamba and vision transformer of the same size? I think Vmamba is several times less efficient than ViT. Do you think so too? Is there any solution?
Hi, thanks for your great work! I have two small questions about the differences between the code and the paper. First, the base model is inferior to the small model in your arxiv paper. How did you solve the problem? Second, I see the usage of MLP in your code but it does not belong to the original mamba block (or the modified block in the paper). What are the effects of using MLP?