Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 422 forks source link

finetune_deepspeed.py A100 80G 单卡跑不起来,显存不足 #166

Closed greatewei closed 1 year ago

greatewei commented 1 year ago

deepspeed 需要多少机器资源才能运行,并且训练时间能够提升多少

Facico commented 1 year ago

你跑的啥模型。我们写的finetune_deepspeed是节省显存的,不能提速多少

greatewei commented 1 year ago

你跑的啥模型。我们写的finetune_deepspeed是节省显存的,不能提速多少

deepspeed finetune_deepspeed.py \ --data_path /data/chat/Chinese-Vicuna/data/test.json \ --output_path /data/chat/models/llama_lora/llama-7b-base-lora/ \ --model_path /data/chat/models/llama_base/llama-7b-hf \ --eval_steps 100 \ --save_steps 100 \ --test_size 100 \ --deepspeed OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 79.35 GiB total capacity; 43.78 GiB already allocated; 1.82 GiB free; 45.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Facico commented 1 year ago

你是用的全量参数微调吗,或者你把--deepspeed能正常跑吗

greatewei commented 1 year ago

你是用的全量参数微调吗,或者你把--deepspeed能正常跑吗

也不行,不过不能提升训练速度,我就不进行尝试了😁

Facico commented 1 year ago

你用我们的finetune_chat那个代码是比原来的finetune的代码要快一点的

reverse-2020 commented 1 year ago

我也是,我用的是3090,跑finetune.py可以正常训练,出结果,使用finetune_deepspeed,提示OOM

Facico commented 1 year ago

我也是,我用的是3090,跑finetune.py可以正常训练,出结果,使用finetune_deepspeed,提示OOM

finetune_deepspeed没开8bit肯定比finetune需要的显存多,因为有些卡用不了8bit才写了那个代码。但相同配置3090/4090是能跑起来的

reverse-2020 commented 1 year ago

我也是,我用的是3090,跑finetune.py可以正常训练,出结果,使用finetune_deepspeed,提示OOM

finetune_deepspeed没开8bit肯定比finetune需要的显存多,因为有些卡用不了8bit才写了那个代码。但相同配置3090/4090是能跑起来的

感谢回复,我再试试看