Open DriverSong opened 1 month ago
Hi there 👋
We don't actually report throughput benchmarks for the 8B model on an H100--all 8B throughput benchmarks were run on an NVIDIA L4. We did run throughput benchmarks for the 70B model on an H100, so you should be able to reproduce those on your hardware.
But to answer your question, yes there absolutely is. The improvement in thoughput is very small when the amount of allocable cache space is large (eg. when running small models on GPUs with lots of vRAM) since large decoding batches can then fit in cache even without any compression. Because of this, our evaluations focus on model/hardware configurations where most of the available memory is allocated to the model parameters.
An additional factor is the input token length. The shorter the inputs are, the easier it is to fit large batches into limited cache space without compression. As input length increases, eventually the limited cache space will impose a limit on the maximum batch size that can be used during decoding, hurting throughput. At this point, compression can be used to free up cache space and improve throughput.
Thanks for taking the time to reproduce the results. If you have any with input_len > 500 I'd be interested to see them!
Your current environment
How would you like to use vllm
I‘ve run the benchmark script
benchmark_llama3_8b.sh
on 1xH100 with model Meta-Llama-3.1-8B-Instruct. With the params unchanged but the--num-prompts
. I set--num-prompts 1000
to run the script and the result is as follow: The throughput does not show a significant or regular pattern with changes in the--compression-rate
. Is there a limit to the improvement in throughput from KV compression?Before submitting a new issue...