speed and memory plots - Githubissues

NVIDIA / kvpress

LLM KV cache compression made easy

Apache License 2.0

232 stars 9 forks source link

speed and memory plots #10

Closed maxjeblick closed 3 days ago

maxjeblick commented 6 days ago

This PR updates the speed_and_memory.ipynb notebook.

The notebook now plots

generation speed vs. compression rate
peak memory vs compression rate
cache size vs compression rate for both default and quantized cache.

I did not include prefilling speed, as this is mostly independent of the cache size (and the repo isn't designed to optimize this part).

Apart from that, the spelling of "Hugging Face" was changed in various files.

SimJeg commented 4 days ago

Could you add a .gitattributes file with:

*.ipynb linguist-documentation

and check the GitHub page stop displaying "74% Jupyter" in the languages section ?

SimJeg commented 4 days ago

(I don't manage to add inline comment in the notebook on the GitHub UI so I'll add them in this main thread)

Curiosity question: is get_size_of_cache necessary ? We know the formula for it: $\text{size} = 2 n_{layers} n_{heads} d \text{precision}$ with $\text{precision}=2$ for float16 and $\text{precision}=0.5$ for int4. The quantized cache adds some parameters with _scale and _shift, does it make a big difference ?

SimJeg commented 4 days ago

Maybe use a global variable for attn_implementation after ckpt=..., device=..., with a comment # use attn_implementation="sdpa" if no access to flash attention

SimJeg commented 4 days ago

About the plots:

it's great to display the 3 metrics (peak memory, cache size and generation time)
I would remove the dashed line because it's not very clear what it is. For instance if it's max memory, why isn't it always reached ?
the generation time for quantized cache is twice slower than with float16, is it expected ? in hf blog post it's not supposed to be slower. See also this issue
could you use the same y-axis for both bfloat16 and int4
maybe remove the 400 tokens curve ?

SimJeg commented 4 days ago

It would be great to add a single plot that illustrates why using kvpress. I propose to recycle the data you created in the notebook with what I believed will interest users the most:

x-axis: context length
y-axis: peak memory usage
4 curves:
- bfloat16 cache
- int4 cache
- bfloat16 cache + 50% compression
- int4 cache + 50% compression

Also, I would add an horizontal dashed line at X and clip all curves below this X value to clearly show that with compression / quantization you can fit more context length in your GPU.

The plot could be saved in an assets directory in the root of the repo along with the kvpress.jpg file. I would display it at the end of the intro

SimJeg commented 4 days ago

Something I don't understand on memory usage:

bfloat16: with 80% compression, cache sizes go from ~16 to ~3 and peak memory from 45 to ~32 so perfect, 13GB saved in both cases
int4: cache size goes from ~8 for ~2 and peak memory from ~33 to ~30. Two questions:
- we went from 6GB saved to only 3 in peak memory.
- why is the cache size only 2x lower than float16 ?
I guess it's related to the _scale and _shift parameters. Would be great to add a comment on this, else it's a bit confusing

maxjeblick commented 4 days ago

Thanks for the feedback!

attn_implementation: IMO, it's Ok to assume flash attn is available when running the benchmarks.
As you pointed out, get_size_of_cache is not directly necessary. I added the function, as it is more explicit IMO.
I would remove the dashed line: Makes sense, thanks for spotting 4.the generation time for quantized cache is twice slower than with float16, is it expected? Interestingly, the prefilling time is similar for both methods. Seems that there's a different between prefilling and generation time.
could you use the same y-axis for both bfloat16 and int4 Yeah, I can change
maybe remove the 400 tokens curve ? Makes sense, I replot with 8_000 tokens start.

SimJeg commented 3 days ago

Thanks for the updates and cool plots. Could you add the last plot in the main README ? (and maybe create an assets dir for the image and kvpress.jlg)

maxjeblick commented 3 days ago

Could you add the last plot in the main README ?

It is in the README under evaluation tab (I can also move it somewhere else, or create a new section). I placed the image under evaluation/assets, but can also move it to a new assets folder, probably together with the repo logo.