Open MDD-0928 opened 6 months ago
By the way, I didn't change the batchsize or image resolution, but only using the different version of VMamba
Can you share your configs with me? I limited the memory of A100 into ~ 25G with torch.cuda.set_per_process_memory_fraction(0.3)
and found OOM when running the code with batch_size 128 with the branch https://github.com/MzeroMiko/VMamba/tree/20240128-achieve
.
I use the config in "/classification/configs/vssm/vssm_base_224.yaml", with batchsize=32,resolution=384*192
I am curious about why OOM only occurs when using the Feb-22th-version code, while the "20240128-achieve" can run smoothly~hhh
Did you change the code? I remember that the parameter DATA.IMG_SIZE
only supports input with type int
, but not tuple
. Can you share your modifications so that I can reproduce what you observed?
In my experiments, I found the memory occupation is slightly decreased compared to the code of "20240128-achieve".
I am testing the VMamba in ReID tasks, so that change the imgsize to 384*192, which is more reasonable for person image, I didn't change the backbone of VMamba, but only add a batchnorm1d before the last "linear" layer of original VMamba for image classification.
I'm sorry, I made a mistake. Actually the two situation mentioned above both occured when I used the "main" branch. Let me say it again. the code of "20240128-achieve" can't hande the training of "base" on 1 4090, it will OOM, that's for sure. while, the code of the "main" branch after your first change in Jan 28th can handle the "base" model training on 1 4090 with GPU memory cost of nearly 20000MB but when I use the latest “main” branch, I found OOM while doing the same training on 1 4090.
I did not have a backup for everyday work, so I am afraid it is hard to find the reason.
But why not use config/vssm1/vssm_tiny_224_0220.yaml
instead? Better performance, less memory occupation (for base(128 + (2,2,12,2))
with batch_size 128 and resolution 224, only 48G memory is needed), faster speed (~2x faster), and also with tiny model released.
I just want to explore how the VMamba will perform in ReID tasks and I would like to explore the best record for this Excellent work in downstream ReID taks. Thanks for you reply!!! Looking forward to your continual work!!!
@MDD-0928 我也将VMamba用于ReID人物,但是加载权重后训练的效果就像没有加载一样,请为你是怎么加载权重的啊
和你的写法是一致的呀
啊救,我测出来VMamba在ReID上不怎么行呢,我感觉是我权重没加载对
一样的
我刚刚试了下调整图片大小到224*244,效果加不加载权重一个样,泪目
是不是其他部分代码有问题?
@MDD-0928 I think the version that you said with low memory cost may be the no-forcing-float32 version
of forward_core
.
Since then, we later observed loss nan
in training from scratch with torch.cuda.amp
and then transfers all parameters related to selective_scan
into float32 forcefully.
For inferencing, you can use amp version of selective_scan
by changing force_fp32
to False in cross_selective_scan
, it may works for finetuning too (meaning that the training will not collapse), but I am not absolutely sure about it.
Thanks for your continuous follow-up! I will check it soon!
First, Thanks for your great work! I find that using the code of Jan. 28th can implement the training process of “base” model on 1 4090 GPU, costing about 20000MB memory, but when using the code of Feb 22th, I can't train the "base" model on 1 4090 GPU, needing 2 GPUs, with each gpu costing about 15000MB memory, I would like to know the reason for this issue.
THANKS A LOT!!