Closed maxjeblick closed 3 days ago
Could you add a .gitattributes
file with:
*.ipynb linguist-documentation
and check the GitHub page stop displaying "74% Jupyter" in the languages section ?
(I don't manage to add inline comment in the notebook on the GitHub UI so I'll add them in this main thread)
Curiosity question: is get_size_of_cache
necessary ? We know the formula for it: $\text{size} = 2 n_{layers} n_{heads} d \text{precision}$ with $\text{precision}=2$ for float16 and $\text{precision}=0.5$ for int4. The quantized cache adds some parameters with _scale
and _shift
, does it make a big difference ?
Maybe use a global variable for attn_implementation
after ckpt=..., device=..., with a comment # use attn_implementation="sdpa" if no access to flash attention
About the plots:
It would be great to add a single plot that illustrates why using kvpress. I propose to recycle the data you created in the notebook with what I believed will interest users the most:
bfloat16 cache
int4 cache
bfloat16 cache + 50% compression
int4 cache + 50% compression
Also, I would add an horizontal dashed line at X and clip all curves below this X value to clearly show that with compression / quantization you can fit more context length in your GPU.
The plot could be saved in an assets
directory in the root of the repo along with the kvpress.jpg
file. I would display it at the end of the intro
Something I don't understand on memory usage:
int4: cache size goes from ~8 for ~2 and peak memory from ~33 to ~30. Two questions:
I guess it's related to the _scale and _shift parameters. Would be great to add a comment on this, else it's a bit confusing
Thanks for the feedback!
attn_implementation
: IMO, it's Ok to assume flash attn is available when running the benchmarks.get_size_of_cache
is not directly necessary. I added the function, as it is more explicit IMO.I would remove the dashed line:
Makes sense, thanks for spotting
4.the generation time for quantized cache is twice slower than with float16, is it expected?
Interestingly, the prefilling time is similar for both methods. Seems that there's a different between prefilling and generation time.could you use the same y-axis for both bfloat16 and int4
Yeah, I can changemaybe remove the 400 tokens curve ?
Makes sense, I replot with 8_000 tokens start. Thanks for the updates and cool plots. Could you add the last plot in the main README ? (and maybe create an assets dir for the image and kvpress.jlg)
Could you add the last plot in the main README ?
It is in the README under evaluation tab (I can also move it somewhere else, or create a new section). I placed the image under evaluation/assets
, but can also move it to a new assets
folder, probably together with the repo logo.
This PR updates the
speed_and_memory.ipynb
notebook.The notebook now plots
I did not include prefilling speed, as this is mostly independent of the cache size (and the repo isn't designed to optimize this part).
Apart from that, the spelling of "Hugging Face" was changed in various files.