ModelTC / llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
https://arxiv.org/abs/2405.06001
Apache License 2.0
326 stars 34 forks source link

KV cache / post-RoPE rotation & quantization in QuaRot #148

Closed sasha-hailo closed 1 month ago

sasha-hailo commented 1 month ago

Hello, First of all, thank you for your effort in creating and sharing this useful repo!

I'm looking into the code of QuaRot implementation. My apologies in advance if I'm missing something, but I could not find where your code implements rotation and quantization of KV cache (in particular, of post-RoPE K values). Do you implement this functionality? If yes, could you please point out where it is made?

Thanks in advance!

Harahan commented 1 month ago

No,and we do not implement KV cache quantization.

sasha-hailo commented 1 month ago

@Harahan, thank you for your response. Would you like to consider this as a feature request? I see that you are also in process of implementing SpinQuant (which is great!), and I think that neither QuaRot or SpinQuant support can be complete without this feature.

Harahan commented 1 month ago

Sorry, we don't plan to add this feature. You can implement this by yourself.

sasha-hailo commented 3 weeks ago

@Harahan, Sorry for getting back to this point, But the LLMC paper (https://arxiv.org/abs/2405.06001) explicitly mentions evaluations of KV cache quantization (Appendix A.4). How can these results be reproduced? :)

Harahan commented 3 weeks ago

@sasha-hailo For that section, we directly benchmarked the test with LightLLM (simulated quantization under 2-bit and real quantization under 4 and 8-bit).