Why GPU Memory cost changes ?

MDD-0928 commented 6 months ago

First, Thanks for your great work! I find that using the code of Jan. 28th can implement the training process of “base” model on 1 4090 GPU, costing about 20000MB memory, but when using the code of Feb 22th, I can't train the "base" model on 1 4090 GPU, needing 2 GPUs, with each gpu costing about 15000MB memory, I would like to know the reason for this issue.

THANKS A LOT!!

MDD-0928 commented 6 months ago

By the way, I didn't change the batchsize or image resolution, but only using the different version of VMamba

MzeroMiko commented 6 months ago

Can you share your configs with me? I limited the memory of A100 into ~ 25G with torch.cuda.set_per_process_memory_fraction(0.3) and found OOM when running the code with batch_size 128 with the branch https://github.com/MzeroMiko/VMamba/tree/20240128-achieve.

MDD-0928 commented 6 months ago

I use the config in "/classification/configs/vssm/vssm_base_224.yaml", with batchsize=32，resolution=384*192

MDD-0928 commented 6 months ago

I am curious about why OOM only occurs when using the Feb-22th-version code, while the "20240128-achieve" can run smoothly~hhh

MzeroMiko commented 6 months ago

Did you change the code? I remember that the parameter DATA.IMG_SIZE only supports input with type int, but not tuple. Can you share your modifications so that I can reproduce what you observed?

In my experiments, I found the memory occupation is slightly decreased compared to the code of "20240128-achieve".

MDD-0928 commented 6 months ago

I am testing the VMamba in ReID tasks, so that change the imgsize to 384*192, which is more reasonable for person image, I didn't change the backbone of VMamba, but only add a batchnorm1d before the last "linear" layer of original VMamba for image classification.

MDD-0928 commented 6 months ago

I'm sorry, I made a mistake. Actually the two situation mentioned above both occured when I used the "main" branch. Let me say it again. the code of "20240128-achieve" can't hande the training of "base" on 1 4090, it will OOM, that's for sure. while, the code of the "main" branch after your first change in Jan 28th can handle the "base" model training on 1 4090 with GPU memory cost of nearly 20000MB but when I use the latest “main” branch, I found OOM while doing the same training on 1 4090.

MzeroMiko commented 6 months ago

I did not have a backup for everyday work, so I am afraid it is hard to find the reason. But why not use config/vssm1/vssm_tiny_224_0220.yaml instead? Better performance, less memory occupation (for base(128 + (2,2,12,2)) with batch_size 128 and resolution 224, only 48G memory is needed), faster speed (~2x faster), and also with tiny model released.

MDD-0928 commented 6 months ago

I just want to explore how the VMamba will perform in ReID tasks and I would like to explore the best record for this Excellent work in downstream ReID taks. Thanks for you reply!!! Looking forward to your continual work!!!

1024AILab commented 6 months ago

@MDD-0928 我也将VMamba用于ReID人物，但是加载权重后训练的效果就像没有加载一样，请为你是怎么加载权重的啊

MDD-0928 commented 6 months ago

和你的写法是一致的呀

1024AILab commented 6 months ago

啊救，我测出来VMamba在ReID上不怎么行呢，我感觉是我权重没加载对

MDD-0928 commented 6 months ago

一样的

1024AILab commented 6 months ago

我刚刚试了下调整图片大小到224*244，效果加不加载权重一个样，泪目

MDD-0928 commented 6 months ago

是不是其他部分代码有问题？

MzeroMiko commented 6 months ago

@MDD-0928 I think the version that you said with low memory cost may be the no-forcing-float32 version of forward_core. Since then, we later observed loss nan in training from scratch with torch.cuda.amp and then transfers all parameters related to selective_scan into float32 forcefully. For inferencing, you can use amp version of selective_scan by changing force_fp32 to False in cross_selective_scan, it may works for finetuning too (meaning that the training will not collapse), but I am not absolutely sure about it.

MDD-0928 commented 6 months ago

Thanks for your continuous follow-up! I will check it soon!

MzeroMiko / VMamba

Why GPU Memory cost changes ? #47