[BUG/Help] 有什么方式能优化推理速度吗?太慢了4token/s - Githubissues

THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型

Other

15.72k stars 1.85k forks source link

[BUG/Help] 有什么方式能优化推理速度吗?太慢了4token/s #311

Open heavenkiller2018 opened 1 year ago

heavenkiller2018 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

我用的是P40的24G的卡, 运行一个大概Q+A共计1200token的prompt就需要5分钟才能回复, 算下来是4token/s的推理速度，这是不是也太低了点。显存占用是13G。

Expected Behavior

有没有什么方法能提高下推理速度吗？譬如有没有一些参数可以设置

Steps To Reproduce

no

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

lilongxian commented 1 year ago

目前好的常用就是量化4bit 用c++推理。可以看看 https://github.com/sophgo/ChatGLM2-TPU https://github.com/li-plus/chatglm.cpp