OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.78k stars 540 forks source link

Finetuning Configuration #195

Open WAILMAGHRANE opened 1 month ago

WAILMAGHRANE commented 1 month ago

Hi, Could you let me know if everyone has successfully fine-tuned the model? Additionally, I have a question about GPU requirements: is 31.2GB needed per GPU, or is it split between two GPUs? Also, I noticed that Kaggle offers 2 T4 GPUs—are these sufficient for fine-tuning my model with a custom dataset? Thanks! Screenshot 2024-06-01 013114

YuzaChongyi commented 1 month ago

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

whyiug commented 4 weeks ago

hi, @YuzaChongyi can we finetune this model with one A100 (40G)?

YuzaChongyi commented 4 weeks ago

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

whyiug commented 4 weeks ago

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution. if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

Yeah, I only have an A100 card(40G). After I use finetune_lora.sh with those settings:

--model_max_length 1024
--per_device_train_batch_size 1
--deepspeed ds_config_zero3.json

It reports an error:

RuntimeError: "erfinv_cuda" not implemented for 'BFloat16'

maybe on this line(https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/20aecf8831d1d7a3da19bd62f44d1aea82df7fee/resampler.py#L85)

Please tell me how to fix it quickly by changing the code or configuration. Thanks for your quick reply:) @YuzaChongyi

YuzaChongyi commented 4 weeks ago

I haven't encountered this error yet, it may be caused by certain pytorch versions or other reasons. If there is a error during the resampler initialization step, you can comment this line because the ckpt will load and reset model state_dict. Or you can use --fp16 true instead of bf16

qyc-98 commented 4 weeks ago

please set --bf16 false \ --bf16_full_eval false \ --fp16 true\ --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

qyc-98 commented 4 weeks ago

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory: "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }

whyiug commented 4 weeks ago

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory: "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }

yeah, i already did.

zhu-j-faceonlive commented 3 weeks ago

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

I changed this, but still got following error. File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_groupid].grad.mul(1. / combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

shituo123456 commented 3 weeks ago

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置--bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \这是因为zero3与bf16不兼容。请使用fp16

I changed this, but still got following error.我改变了这个,但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads 文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”,第 2117 行,位于 unscale_and_clip_grads 中 self.fp32_partitioned_groups_flat[sub_groupid].grad.mul(1. / combined_scale) self.fp32_partitioned_groups_flat[sub_groupid].grad.mul(1./combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError:预期所有张量都在同一设备上,但发​​现至少有两个设备,cuda:0和cpu!

在双卡4090上这样改我也是同样报错,请问解决了吗,谢谢~