Open mkserge opened 3 days ago
@mkserge at such low batch sizes the time needed to prefill is very small hence you don’t see big difference in performance. Prefill is usually compute bound and for your setup (8xa100) theoretical flops is around 2.5pflops. Flops needed to do 1200 tokens prefill of 7B model is around 8.5 tflops. Meaning that theoretical time needed to do batch size 1 prefill is around 3.5 ms. In reality it’s more than this, but this gives a good idea on the magnitude. Therefore until your prefill tokens counts doesn’t become significant you wouldn’t see big benefit of caching prefill.
Hello,
I am benchmarking KV cache re-use with Mistral 7B model using tensor parallel across 8 A100 GPUs (A100-SXM4-40GB).
My instruction prompt is fixed at 1214 tokens, and maximum sequence length is 1357 tokens (input + output).
From the graph the throughput at a given latency threshold increases significantly, which seems to make sense, but I am a bit surprised at a much smaller gain in latency at lower request rate. For example, at request_rate = 1, average sequence latency goes from 115.35ms down to 94.56ms when re-using KV cache. Isn't this low considering that a very large chunk of the input prompt is cached?
Results,
For reference, I build the model with
and benchmark it using