end to end; no streaming, no batching, just a single request.
in production, you will have many metrics. most definitely, you will use some kind of batching and that has effect on performance. for vllm, that means max-num-seqs/num_batched_tokens (see https://www.github.com/vllm-project/vllm/issues/2492), which are limited by VRAM. you could tune it to reach your ttft goals (or even prefilling using another machine; it's production, after all). but that's difficult to benchmark because of the large space of variables. we are open to suggestions.
Has this been supported or requested before?
What is this feature about?
Qwen的性能报告能否把首Token延迟也提供一下
Proposal
目前Qwen的性能报告只提供了Token/s 1、未说明这是端到端速率,还是排除首Token延迟后的速率 2、未提供首Token延迟指标(Time To First Token),对于在线交互流式任务这个指标非常重要。建议单独提供一下,方便用户做模型选型。
Contributions are welcomed