MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
swift显存控制不住的涨 #649

Open 2013358072 opened 1 month ago

2013358072 commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

使用swift框架FT的时候显存占用达到了245G 训练参数: --sft_type lora --gradient_accumulation_steps 4 --tuner_backend peft --target_modules DEFUALT 启动参数: HIP_VISIBLE_DEVICES=0,1,2,4 PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:4096

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

HIP_VISIBLE_DEVICES=0,1,2,4 PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:4096 swift sft --model_type minicpm-v-v2_6-chat --dataset data1.jsonl --dataset_test_ratio 0.1 --sft_type lora --learning_rate 1e-4 --num_train_epochs 5 --model_id_or_path MiniCPM-V-2_6/ --grad ient_accumulation_steps 4 --tuner_backend peft

运行环境 | Environment

备注 | Anything else?

2013358072 commented 1 month ago

在第五个step的时候OOM Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 9.28 GiB. GPU 3 has a total capacty of 63.98 GiB of which 7.13 GiB is free. Of the allocated memory 45.88 GiB is allocated by PyTorch, and 7.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CON

zhaoyangwei123 commented 2 weeks ago

在第五个step的时候OOM Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 9.28 GiB. GPU 3 has a total capacty of 63.98 GiB of which 7.13 GiB is free. Of the allocated memory 45.88 GiB is allocated by PyTorch, and 7.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CON

你好,请问你解决了吗,我用8卡4090 lora int4模型都爆显存了

LDLINGLINGLING commented 1 week ago
