THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.73k stars 1.85k forks source link

[Help] 使用openai流式接口时,随着调用次数越多显存占用越多 #566

Open beileiyuyue opened 1 year ago

beileiyuyue commented 1 year ago

Is there an existing issue for this?

Current Behavior

显卡是一张A100-40g,部署chatglm2-6b的http接口时占用显存一直是13g左右,但部署openai式流式接口时占用显存会越来越多。 刚部署时的显存占用时13g左右:

截屏2023-09-25 10 56 10

调用了一些次数之后就变成16g了:

截屏2023-09-25 10 57 06

一直调用的话,显存会占用到30g左右。请教一下有可能是什么问题

Expected Behavior

No response

Steps To Reproduce

Environment

- OS:centos 7
- Python:3.9
- Transformers:4.32.1
- PyTorch:1.12.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

pipidoudou commented 1 year ago

需要主动调用一下torch.cuda.empty_cache()释放显存