Open tubaobao3 opened 6 months ago
(venv) [app@vm_0_1_centos projects]$ python ds_estimate.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.59s/it] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 6173M total params, 534M largest layer params. per CPU | per GPU | Options 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=1 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=0 11.95GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=1 137.98GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=0
ds_estimate.py :
from transformers import AutoModel from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live
model = AutoModel.from_pretrained('/data/projects/ChatGLM-6B', trust_remote_code=True) estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4, num_nodes=1)
No response
- OS: Centos7 - Python: 3.8 - Transformers: 4.29.1 - PyTorch: 2.0.8 - CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
我是在基于chatglm6b做模型微调,在deepspeed初始化阶段,就失败了,我是单机4卡的环境、每张nvidia卡都是15g显存; ds初始化阶段,4个gpu的显存占用都到12G,其中3号卡原本就有3g被占用了,所以3号卡继续申请显存,程序就crash及OOM了; 这意思是说只是在ds初始化阶段,开了stage=3,不启动offload,单机4卡,每一张卡12G都放不下这个模型吗?
Is there an existing issue for this?
Current Behavior
(venv) [app@vm_0_1_centos projects]$ python ds_estimate.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.59s/it] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 6173M total params, 534M largest layer params. per CPU | per GPU | Options 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=1 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=0 11.95GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=1 137.98GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=0
ds_estimate.py :
from transformers import AutoModel from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live
model = AutoModel.from_pretrained('/data/projects/ChatGLM-6B', trust_remote_code=True) estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4, num_nodes=1)
Expected Behavior
No response
Steps To Reproduce
(venv) [app@vm_0_1_centos projects]$ python ds_estimate.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.59s/it] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 6173M total params, 534M largest layer params. per CPU | per GPU | Options 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 155.23GB | 1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=1 137.98GB | 4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=0 11.95GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=1 137.98GB | 27.86GB | offload_param=none, offload_optimizer=none, zero_init=0
ds_estimate.py :
from transformers import AutoModel from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live
model = AutoModel.from_pretrained('/data/projects/ChatGLM-6B', trust_remote_code=True) estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4, num_nodes=1)
Environment
Anything else?
No response