THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Apache License 2.0
40.5k stars 5.2k forks source link

Inference speed benchmark? #24

Open wizd opened 1 year ago

wizd commented 1 year ago

Cool model! I'll have a try. I'd like to know 5 token/s minimal hardware requirement.

yaleimeng commented 1 year ago

没太注意。可供参考的数据是:在默认参数(没有量化,half精度)下,RTX3090推理大部分问题在10s上下。但回复通常有几百字。如果输出特别短,大约2~3秒就完成了。 不过在最低要求(INT4量化)下,至少也需要有10GB显存,因为虽然一开始启动只占用6GB,但对话几轮就会再上升50%左右。搭配足够显存的GPU应该推理速度大体是能接受的。

duzx16 commented 1 year ago

It depends on your hardware, the model precision, the context length and generation length. I only experimented on A100 with FP16 and the speed is about 20-30 tokens/s in the leading . Others are welcome to share their benchmarking result in this issue. Please specify the environment and settings.

xuguozhi commented 1 year ago

可不可以比较一下在GPU比如T4 V100等的per token latency image 参考链接:https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu