chatglm3-6b-32k处理长文本时，out of memory

System Info / 系統信息

cuda 11.7 transformes 4.37.2 python3.10

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1.GPU为 DGX-A800-80G 2.export CUDA_VISIBLE_DEVICES=1,2


tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
model = AutoModel.from_pretrained(model_path,trust_remote_code=True,device_map="auto",torch_dtype=torch.float16)
query="balabala"
ids = tokenizer.encode(query, add_special_tokens=True)
print(len(ids)) #长度大约为30k
input_ids = torch.LongTensor([ids])
model.eval()
generated_ids = model.generate(
        input_ids=input_ids,
        max_new_tokens=16,
        # min_new_tokens=len(target_new_id),
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(input_ids, generated_ids)] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response)



显示torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.18 GiB (GPU 0; 79.35 GiB total capacity; 46.59 GiB already allocated; 11.25 GiB free; 66.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### Expected behavior / 期待表现

输入长度为30k左右，已经用两张卡加载模型了，还是会显示out-of-memory
请求一下帮助~谢谢~

THUDM / ChatGLM3

chatglm3-6b-32k处理长文本时，out of memory #1294

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程