Open 2013358072 opened 1 month ago
在第五个step的时候OOM Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 9.28 GiB. GPU 3 has a total capacty of 63.98 GiB of which 7.13 GiB is free. Of the allocated memory 45.88 GiB is allocated by PyTorch, and 7.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CON
在第五个step的时候OOM Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 9.28 GiB. GPU 3 has a total capacty of 63.98 GiB of which 7.13 GiB is free. Of the allocated memory 45.88 GiB is allocated by PyTorch, and 7.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CON
你好,请问你解决了吗,我用8卡4090 lora int4模型都爆显存了
你好,swift并不是我们直接在维护。推荐还是使用官方微调代码
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
使用swift框架FT的时候显存占用达到了245G 训练参数: --sft_type lora --gradient_accumulation_steps 4 --tuner_backend peft --target_modules DEFUALT 启动参数: HIP_VISIBLE_DEVICES=0,1,2,4 PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:4096
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
HIP_VISIBLE_DEVICES=0,1,2,4 PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:4096 swift sft --model_type minicpm-v-v2_6-chat --dataset data1.jsonl --dataset_test_ratio 0.1 --sft_type lora --learning_rate 1e-4 --num_train_epochs 5 --model_id_or_path MiniCPM-V-2_6/ --grad ient_accumulation_steps 4 --tuner_backend peft
运行环境 | Environment
备注 | Anything else?
No response