THUDM / ChatGLM3

ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
Apache License 2.0
13.5k stars 1.57k forks source link

chatglm3-6b-32k处理长文本时,out of memory #1294

Open SXxinxiaosong opened 4 months ago

SXxinxiaosong commented 4 months ago

System Info / 系統信息

cuda 11.7 transformes 4.37.2 python3.10

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

1.GPU为 DGX-A800-80G 2.export CUDA_VISIBLE_DEVICES=1,2

  1. 
    tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
    model = AutoModel.from_pretrained(model_path,trust_remote_code=True,device_map="auto",torch_dtype=torch.float16)
    query="balabala"
    ids = tokenizer.encode(query, add_special_tokens=True)
    print(len(ids)) #长度大约为30k
    input_ids = torch.LongTensor([ids])
    model.eval()
    generated_ids = model.generate(
            input_ids=input_ids,
            max_new_tokens=16,
            # min_new_tokens=len(target_new_id),
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(input_ids, generated_ids)] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response)



显示torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.18 GiB (GPU 0; 79.35 GiB total capacity; 46.59 GiB already allocated; 11.25 GiB free; 66.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### Expected behavior / 期待表现

输入长度为30k左右,已经用两张卡加载模型了,还是会显示out-of-memory
请求一下帮助~谢谢~
zRzRzRzRzRzRzR commented 2 months ago

长度越长,占用的显存越多,你两张卡一共多少呢