Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.68k stars 170 forks source link

Error : NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet #76

Closed WeiXuanLi-1024 closed 10 months ago

WeiXuanLi-1024 commented 11 months ago

我使用 alpacaLlava_llamaQformerv2Peft_QF_13B.sh 微调模型,在迭代完一次epoch , 保存模型时出现了错误,使用单卡跑的。脚本内容如下 2ae814302c794ac8582b4991947af66

File "/home/liwx/anaconda3/envs/accessory/lib/python3.10/contextlib.py", line 135, in __enter__ return next(self.gen) File "/home/liwx/anaconda3/envs/accessory/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 171, in _unshard_fsdp_state_params _validate_unshard_params_args( File "/home/liwx/anaconda3/envs/accessory/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 140, in _validate_unshard_params_args raise NotImplementedError( NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet

kriskrisliu commented 11 months ago

The script is correct. Please show us the complete error logs.

WeiXuanLi-1024 commented 11 months ago

3ff401903d494fa95c3b349dee605c6

kriskrisliu commented 11 months ago

Did you modify any codes?

WeiXuanLi-1024 commented 11 months ago

没有改动任何代码

Did you modify any codes?

WeiXuanLi-1024 commented 11 months ago

使用的单GPU

WeiXuanLi-1024 commented 11 months ago

这个问题已经解决:参考https://github.com/huggingface/accelerate/pull/1745/commits/940ae8dfff504ce7e7e3015bcbbd17e3e8cd3157

原因: # FSDP raises error when single GPU is used with offload_to_cpu=True for FULL_STATE_DICT

so, only enable it when num_processes>1

       也就是在使用单GPU时,必须要将本仓库中 misc.py line348  两个参数设置为  rank0_only=False, offload_to_cpu=False

设置完运行结果:正常保存 336d5fb3faa7083c8645a8d8002f0bd

funkyyyyyy commented 11 months ago

这个问题已经解决:参考huggingface/accelerate@940ae8d

原因: # FSDP raises error when single GPU is used with offload_to_cpu=True for FULL_STATE_DICT # so, only enable it when num_processes>1 也就是在使用单GPU时,必须要将本仓库中 misc.py line348 两个参数设置为 rank0_only=False, offload_to_cpu=False

设置完运行结果:正常保存 336d5fb3faa7083c8645a8d8002f0bd

just clarifying, if using single GPU, set https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/afa3947742d6262255ba8ba1808ff999246e4abd/accessory/util/misc.py#L347C29-L347C29 offload_to_cpu=False

thank you @WeiXuanLi-1024