QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
9.49k stars 584 forks source link

[REQUEST]: Qwen的性能报告能否把首Token延迟也提供一下 #1011

Open zhufeizzz opened 1 month ago

zhufeizzz commented 1 month ago

Has this been supported or requested before?

What is this feature about?

Qwen的性能报告能否把首Token延迟也提供一下

Proposal

目前Qwen的性能报告只提供了Token/s 1、未说明这是端到端速率,还是排除首Token延迟后的速率 2、未提供首Token延迟指标(Time To First Token),对于在线交互流式任务这个指标非常重要。建议单独提供一下,方便用户做模型选型。 image

Contributions are welcomed

jklj077 commented 4 weeks ago
  1. end to end; no streaming, no batching, just a single request.
  2. in production, you will have many metrics. most definitely, you will use some kind of batching and that has effect on performance. for vllm, that means max-num-seqs/num_batched_tokens (see https://www.github.com/vllm-project/vllm/issues/2492), which are limited by VRAM. you could tune it to reach your ttft goals (or even prefilling using another machine; it's production, after all). but that's difficult to benchmark because of the large space of variables. we are open to suggestions.