NVIDIA / kvpress

LLM KV cache compression made easy
Apache License 2.0
232 stars 9 forks source link

speed and memory plots #10

Closed maxjeblick closed 3 days ago

maxjeblick commented 6 days ago

This PR updates the speed_and_memory.ipynb notebook.

The notebook now plots

I did not include prefilling speed, as this is mostly independent of the cache size (and the repo isn't designed to optimize this part).

Apart from that, the spelling of "Hugging Face" was changed in various files.

SimJeg commented 4 days ago

Could you add a .gitattributes file with:

*.ipynb linguist-documentation

and check the GitHub page stop displaying "74% Jupyter" in the languages section ?

SimJeg commented 4 days ago

(I don't manage to add inline comment in the notebook on the GitHub UI so I'll add them in this main thread)

Curiosity question: is get_size_of_cache necessary ? We know the formula for it: $\text{size} = 2 n_{layers} n_{heads} d \text{precision}$ with $\text{precision}=2$ for float16 and $\text{precision}=0.5$ for int4. The quantized cache adds some parameters with _scale and _shift, does it make a big difference ?

SimJeg commented 4 days ago

Maybe use a global variable for attn_implementation after ckpt=..., device=..., with a comment # use attn_implementation="sdpa" if no access to flash attention

SimJeg commented 4 days ago

About the plots:

SimJeg commented 4 days ago

It would be great to add a single plot that illustrates why using kvpress. I propose to recycle the data you created in the notebook with what I believed will interest users the most:

Also, I would add an horizontal dashed line at X and clip all curves below this X value to clearly show that with compression / quantization you can fit more context length in your GPU.

The plot could be saved in an assets directory in the root of the repo along with the kvpress.jpg file. I would display it at the end of the intro

SimJeg commented 4 days ago

Something I don't understand on memory usage:

maxjeblick commented 4 days ago

Thanks for the feedback!

  1. attn_implementation: IMO, it's Ok to assume flash attn is available when running the benchmarks.
  2. As you pointed out, get_size_of_cache is not directly necessary. I added the function, as it is more explicit IMO.
  3. I would remove the dashed line: Makes sense, thanks for spotting 4.the generation time for quantized cache is twice slower than with float16, is it expected? Interestingly, the prefilling time is similar for both methods. Seems that there's a different between prefilling and generation time.
  4. could you use the same y-axis for both bfloat16 and int4 Yeah, I can change
  5. maybe remove the 400 tokens curve ? Makes sense, I replot with 8_000 tokens start.
SimJeg commented 3 days ago

Thanks for the updates and cool plots. Could you add the last plot in the main README ? (and maybe create an assets dir for the image and kvpress.jlg)

maxjeblick commented 3 days ago

Could you add the last plot in the main README ?

It is in the README under evaluation tab (I can also move it somewhere else, or create a new section). I placed the image under evaluation/assets, but can also move it to a new assets folder, probably together with the repo logo.