请问基于qwen-72b-chat，基于怎样的配置可以在一台4090上训练起来？

QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Apache License 2.0

13.59k stars 1.11k forks source link

请问基于qwen-72b-chat，基于怎样的配置可以在一台4090上训练起来？ #1224

Closed taishan1994 closed 5 months ago

taishan1994 commented 5 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

No response

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

jklj077 commented 5 months ago

https://github.com/QwenLM/Qwen/tree/main/recipes/finetune/deepspeed#settings-and-gpu-requirements

Not possible for 24GB * 4.

taishan1994 commented 5 months ago

是8卡，不是4卡。因为看到qlora在一张80G显卡上，设置长度为4096，训练需要68.0G，所以通过qlora的方式能否在8*24G的机器上设置长度为4096来微调qwen-72b-chat。但是我是用qlora+zero2进行多GPU微调。发现在模型加载的时候就会报OOM，如果使用zero3，并把model_offord和optim_offord打开，模型可以正常加载，但是qlora和zero3又不能同时使用。所以想问下有什么方式可以做到我想要做的。

jklj077 commented 5 months ago

For ZeRO Stage 2, having each GPU capable of holding the entire model is a bare minimum requirement. Given that Qwen-72B-Chat-Int4 exceeds 40GB, trying to finetune a model of this scale using Q-LoRA with GPUs that only have 24GB (or even 48GB) of memory simply won't cut it.

taishan1994 commented 5 months ago

了解了，谢谢您的回答。