Inference speed benchmark?

THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

Apache License 2.0

40.5k stars 5.2k forks source link

Inference speed benchmark? #24

Open wizd opened 1 year ago

wizd commented 1 year ago

Cool model! I'll have a try. I'd like to know 5 token/s minimal hardware requirement.

yaleimeng commented 1 year ago

没太注意。可供参考的数据是：在默认参数（没有量化，half精度）下，RTX3090推理大部分问题在10s上下。但回复通常有几百字。如果输出特别短，大约2~3秒就完成了。不过在最低要求（INT4量化）下，至少也需要有10GB显存，因为虽然一开始启动只占用6GB，但对话几轮就会再上升50%左右。搭配足够显存的GPU应该推理速度大体是能接受的。

duzx16 commented 1 year ago

It depends on your hardware, the model precision, the context length and generation length. I only experimented on A100 with FP16 and the speed is about 20-30 tokens/s in the leading . Others are welcome to share their benchmarking result in this issue. Please specify the environment and settings.

xuguozhi commented 1 year ago

可不可以比较一下在GPU比如T4 V100等的per token latency 参考链接：https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu