THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Apache License 2.0
39.96k stars 5.15k forks source link

[Help] <针对chatglm6b,不启动offload,zero_stage=3的状态下,单机4卡下,貌似需要GPU28g的显存,这貌似没有用到模型并行的能力,请问这是什么原因?> #1442

Open tubaobao3 opened 6 months ago

tubaobao3 commented 6 months ago

Is there an existing issue for this?

Current Behavior

(venv) [app@vm_0_1_centos projects]$ python ds_estimate.py

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.59s/it] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 6173M total params, 534M largest layer params. per CPU | per GPU | Options 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=1 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=0 11.95GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=1 137.98GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=0

ds_estimate.py :

from transformers import AutoModel from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live

model = AutoModel.from_pretrained('/data/projects/ChatGLM-6B', trust_remote_code=True) estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4, num_nodes=1)

Expected Behavior

No response

Steps To Reproduce

(venv) [app@vm_0_1_centos projects]$ python ds_estimate.py

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.59s/it] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 6173M total params, 534M largest layer params. per CPU | per GPU | Options 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=1 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=0 11.95GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=1 137.98GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=0

ds_estimate.py :

from transformers import AutoModel from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live

model = AutoModel.from_pretrained('/data/projects/ChatGLM-6B', trust_remote_code=True) estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4, num_nodes=1)

Environment

- OS: Centos7
- Python: 3.8
- Transformers: 4.29.1
- PyTorch: 2.0.8
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

No response

tubaobao3 commented 6 months ago

我是在基于chatglm6b做模型微调,在deepspeed初始化阶段,就失败了,我是单机4卡的环境、每张nvidia卡都是15g显存; ds初始化阶段,4个gpu的显存占用都到12G,其中3号卡原本就有3g被占用了,所以3号卡继续申请显存,程序就crash及OOM了; 这意思是说只是在ds初始化阶段,开了stage=3,不启动offload,单机4卡,每一张卡12G都放不下这个模型吗?