Open benchislett opened 1 month ago
I think there was some original reasoning why they didn't use the ITL for the output length here, basically multiple tokens could be bundled:
We should double check if their reasoniong is still valid
@andoorve I see two competing issues here:
First, that some tokens are bundled together when decoded/yielded. This throws off the metrics because we would not count enough total tokens, so the averages would be off.
However a secondary effect of re-tokenizing the generated text is that some textual words that were generated as a sequence of tokens can be tokenized into one token representing a combined word. This leads to issues like the profile given above, where long sequences collapse into a small handful of tokens and metrics like TPOT go sky-high.
It makes sense to avoid using len(ITLs)
to resolve the first issue. But using len(tokenize(generated_output ...))
introduces a secondary issue. I would propose that the underlying issue is that the measured ITLs are incorrect in the case of multi-token words.
Yes, I agree 100%, from the benchmarking POV we are stuck with either the first issue or the second issue depending on what we do.
I would propose that the underlying issue is that the measured ITLs are incorrect in the case of multi-token words.
Yup exactly, I think this may already have been fixed on vLLM side, i.e. make 1 token per output stream, but not completely sure. If it's been fixed already, I think we're good to go with this change directly. If not, just depends on if the primary issue or secondary issue causes more inaccuracy.
@andoorve would somethink like min(max_expected_length, max(tokenized_length, itl_length + 1))
make sense? This isn't perfectly robust but I think it would get us closer to the expected answer. I suppose there might be some trouble in determining an upper bound in some cases. Do we even care about over-estimating v.s. under-estimating?
I think that makes sense, but as you said the difficulty would be in getting max_expected_length
. We don't have any good reasons for over-estimating over under-estimating or vice-versa, so as long as we make a best effort to keep it close should be good enough. Personally, even max(tokenized_length, itl_length + 1)
could be ok with some caveats stated in the docs since it's out of the control of the benchmarker.
This PR changes the way that the flexible-inference-benchmark performance analysis post-processor counts the number of tokens in an evaluation.
Previously, the generated text is re-tokenized, and the total number of tokens are counted. But, this is not always an invertible process. For instance, a sequence of N identical "space" characters may be emitted. When re-tokenized this will collapse into a small collection of "large space" tokens which represent a large portion of whitespace. Llama-7b for example can tokenize
" " * 15
into a single token.This means that the ITL count and the expected sequence length do not match. When dividing
Latency - TTFT
by the incorrect number of output tokens, the resultingTPOT
is incorrect. To fix this, we simply use the number of ITL samples plus one.Below is a side-by-side comparison of CServe outputs before and after this patch, the discrepancy mostly caused by one erroneous sample whose result was a sequence of over 100 spaces that was expected to be only 7 tokens, due to the wide-space tokens.
Also included in this PR is a general update to the README, whose references appear out-of-date.