[REQUEST]: Qwen的性能报告能否把首Token延迟也提供一下 - Githubissues

QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.

9.49k stars 584 forks source link

[REQUEST]: Qwen的性能报告能否把首Token延迟也提供一下 #1011

Open zhufeizzz opened 1 month ago

zhufeizzz commented 1 month ago

Has this been supported or requested before?

[X] I have checked the GitHub README.
[X] I have checked the Qwen documentation.
[X] I have checked the documentation of the related framework.
[X] I have searched the issues and there is not a similar one.

What is this feature about?

Qwen的性能报告能否把首Token延迟也提供一下

Proposal

目前Qwen的性能报告只提供了Token/s 1、未说明这是端到端速率，还是排除首Token延迟后的速率 2、未提供首Token延迟指标（Time To First Token），对于在线交互流式任务这个指标非常重要。建议单独提供一下，方便用户做模型选型。

Contributions are welcomed

[ ] I am willing to help implement this feature.

jklj077 commented 4 weeks ago

end to end; no streaming, no batching, just a single request.
in production, you will have many metrics. most definitely, you will use some kind of batching and that has effect on performance. for vllm, that means max-num-seqs/num_batched_tokens (see https://www.github.com/vllm-project/vllm/issues/2492), which are limited by VRAM. you could tune it to reach your ttft goals (or even prefilling using another machine; it's production, after all). but that's difficult to benchmark because of the large space of variables. we are open to suggestions.