Bugfix for `performance.py` reporting incorrect TPOT

benchislett commented 1 month ago

This PR changes the way that the flexible-inference-benchmark performance analysis post-processor counts the number of tokens in an evaluation.

Previously, the generated text is re-tokenized, and the total number of tokens are counted. But, this is not always an invertible process. For instance, a sequence of N identical "space" characters may be emitted. When re-tokenized this will collapse into a small collection of "large space" tokens which represent a large portion of whitespace. Llama-7b for example can tokenize " " * 15 into a single token.

This means that the ITL count and the expected sequence length do not match. When dividing Latency - TTFT by the incorrect number of output tokens, the resulting TPOT is incorrect. To fix this, we simply use the number of ITL samples plus one.

Below is a side-by-side comparison of CServe outputs before and after this patch, the discrepancy mostly caused by one erroneous sample whose result was a sequence of over 100 spaces that was expected to be only 7 tokens, due to the wide-space tokens.

============ Serving Benchmark Result ============              ============ Serving Benchmark Result ============
Successful requests:                     200                    Successful requests:                     200       
Benchmark duration (s):                  44.90                  Benchmark duration (s):                  44.90     
Total input tokens:                      3688                   Total input tokens:                      3688      
Total generated tokens:                  39419                | Total generated tokens:                  39347     
Request throughput (req/s):              4.45                   Request throughput (req/s):              4.45      
Input token throughput (tok/s):          82.15                  Input token throughput (tok/s):          82.15     
Output token throughput (tok/s):         878.02               | Output token throughput (tok/s):         876.42    
---------------Time to First Token----------------              ---------------Time to First Token----------------
Mean TTFT (ms):                          51.92                  Mean TTFT (ms):                          51.92     
Median TTFT (ms):                        51.46                  Median TTFT (ms):                        51.46     
P99 TTFT (ms):                           69.36                  P99 TTFT (ms):                           69.36     
-----Time per Output Token (excl. 1st token)------              -----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.97                | Mean TPOT (ms):                          23.82     
Median TPOT (ms):                        23.19                | Median TPOT (ms):                        23.31     
P99 TPOT (ms):                           493.69               | P99 TPOT (ms):                           44.39     
---------------Inter-token Latency----------------              ---------------Inter-token Latency----------------
Mean ITL (ms):                           23.44                  Mean ITL (ms):                           23.44     
Median ITL (ms):                         21.49                  Median ITL (ms):                         21.49     
P99 ITL (ms):                            56.38                  P99 ITL (ms):                            56.38     
==================================================              ==================================================

Also included in this PR is a general update to the README, whose references appear out-of-date.

andoorve commented 1 month ago

I think there was some original reasoning why they didn't use the ITL for the output length here, basically multiple tokens could be bundled:

https://github.com/vllm-project/vllm/blob/72fc97a0f100b92f1ff6c6a16e27d12f1c7569aa/benchmarks/benchmark_serving.py#L329-L335

We should double check if their reasoniong is still valid

benchislett commented 1 month ago

@andoorve I see two competing issues here:

First, that some tokens are bundled together when decoded/yielded. This throws off the metrics because we would not count enough total tokens, so the averages would be off.

However a secondary effect of re-tokenizing the generated text is that some textual words that were generated as a sequence of tokens can be tokenized into one token representing a combined word. This leads to issues like the profile given above, where long sequences collapse into a small handful of tokens and metrics like TPOT go sky-high.

It makes sense to avoid using len(ITLs) to resolve the first issue. But using len(tokenize(generated_output ...)) introduces a secondary issue. I would propose that the underlying issue is that the measured ITLs are incorrect in the case of multi-token words.

andoorve commented 1 month ago

Yes, I agree 100%, from the benchmarking POV we are stuck with either the first issue or the second issue depending on what we do.

I would propose that the underlying issue is that the measured ITLs are incorrect in the case of multi-token words.

Yup exactly, I think this may already have been fixed on vLLM side, i.e. make 1 token per output stream, but not completely sure. If it's been fixed already, I think we're good to go with this change directly. If not, just depends on if the primary issue or secondary issue causes more inaccuracy.

benchislett commented 1 month ago

@andoorve would somethink like min(max_expected_length, max(tokenized_length, itl_length + 1)) make sense? This isn't perfectly robust but I think it would get us closer to the expected answer. I suppose there might be some trouble in determining an upper bound in some cases. Do we even care about over-estimating v.s. under-estimating?

andoorve commented 1 month ago

I think that makes sense, but as you said the difficulty would be in getting max_expected_length. We don't have any good reasons for over-estimating over under-estimating or vice-versa, so as long as we make a best effort to keep it close should be good enough. Personally, even max(tokenized_length, itl_length + 1) could be ok with some caveats stated in the docs since it's out of the control of the benchmarker.

CentML / flexible-inference-bench

Bugfix for `performance.py` reporting incorrect TPOT #66