THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.68k stars 1.85k forks source link

[BUG/Help] 单张A800 显存80g 跑chatglm2没问题,但是使用两张A40,一张A40显存48g,跑chatglm2报了torch.cuda.OutOfMemoryError: CUDA out of memory. #621

Open zhengdacheng opened 10 months ago

zhengdacheng commented 10 months ago

Is there an existing issue for this?

Current Behavior

deepspeed --num_gpus=2 --master_port $MASTER_PORT test02.py \ --deepspeed deepspeed.json \ --do_train \ --train_file /data/train.json \ --test_file /data/test.json \ --prompt_column prompt \ --response_column response \ --history_column history \ --overwrite_cache \ --model_name_or_path ../model \ --output_dir ./output/dataclass-$PROM_TYPE-chatglm2-6b-ft-$LR \ --overwrite_output_dir \ --max_source_length 512 \ --max_target_length 16 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 2 \ --predict_with_generate \ --max_steps 8000 \ --logging_steps 10 \ --save_steps 4000 \ --learning_rate $LR \ --fp16

from torch.nn import Module from transformers import AutoModel

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:

transformer.word_embeddings 占用1层

# transformer.final_layernorm 和 lm_head 占用1层
# transformer.layers 占用 28 层
# 总共30层分配到num_gpus张卡上
num_trans_layers = 28
per_gpu_layers = 30 / num_gpus

# bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
# windows下 model.device 会被设置成 transformer.word_embeddings.device
# linux下 model.device 会被设置成 lm_head.device
# 在调用chat或者stream_chat时,input_ids会被放到model.device上
# 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
# 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
# 本文件来源于https://github.com/THUDM/ChatGLM-6B/blob/main/utils.py
# 仅此处做少许修改以支持ChatGLM2
device_map = {
    'transformer.embedding.word_embeddings': 0,
    'transformer.encoder.final_layernorm': 0,
    'transformer.output_layer': 0,
    'transformer.rotary_pos_emb': 0,
    'lm_head': 0
}

used = 2
gpu_target = 0
for i in range(num_trans_layers):
    if used >= per_gpu_layers:
        gpu_target += 1
        used = 0
    assert gpu_target < num_gpus
    device_map[f'transformer.encoder.layers.{i}'] = gpu_target
    used += 1

return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, device_map: Optional[Dict[str, int]] = None, kwargs) -> Module: if num_gpus < 2 and device_map is None: model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, kwargs).half().cuda() else: from accelerate import dispatch_model

    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()

    if device_map is None:
        device_map = auto_configure_device_map(num_gpus)

    model = dispatch_model(model, device_map=device_map)

return model
- ptuning时把ptuning 文件夹下的main.py 中
`model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)`替换为`model = load_model_on_gpus(model_args.model_name_or_path, num_gpus=2)`

### Expected Behavior

两张A40的显存按理来说甚至比一张A800 的显存多的,期望多卡可以不报oom错误

### Steps To Reproduce

如current behavior所示

### Environment

```markdown
- OS: Centos 8
- Python:3.11.5
- Transformers:2.2.2
- PyTorch:2.1.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

LeeQuan1 commented 4 months ago

你解决这个问题了吗