Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Apache License 2.0
1.1k stars 171 forks source link

Performance issues with ~2bit quantization #113

Closed Devy99 closed 1 month ago

Devy99 commented 1 month ago

Hello, I want to report and ask for suggestions regarding the quantization of CodeLlama 7B with AQLM. I followed the instruction proposed in the README file, and quantized this model with the 1x15 g=8 setting, resulting in a quantized model of ~2.02bpw.

I converted the model to the .bin and .safetensors formats and skipped the next fine-tuning step, just to try the model in inference mode. However, I experienced an incredible decrease in inference speed, both using the transformers library and vLLM (about ~1.58 token per second, using the reference notebook in the repo ).

I tested the quantized model on several GPUs (RTX 3090, RTX A5000, and A100 40GB) but in all cases I failed to speed-up the inference time (about 1 minute and half for a single response with 1k max tokens). For vLLM, I am also using the suggested CUDA and pytorch configurations.

However, when I run your model "ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16" with the same settings, the inference speed is considerable, as reported in your README.

Did I miss something?

BlackSamorez commented 1 month ago

Hi @Devy99! Unfortunately, we do not have efficient kernels for the 1x15 setup, not is it actually 1x15 in memory because there are no 15-bit data types to store those weights in. The codes are stored in int16 anyway, making it basically a 1x16 that never uses the second half of it's codebook. As such, I can make 1x15 run on the 1x16 kernels (instead of the slow kernels that you encountered) with minimal changes to the code, but at this point you would be better off to just use 1x16. 1x15 was created to mitigate the codebooks memory footprint, achieving a bitwidth that is actually close to exactly 2 bits for a fair comparison with other 2-bit methods. It was never designed to be used in practice, and we're sorry if we deceived you in that regard. As such, I would recommend you re-quantize your model as 1x16 for better performance at, alas, exactly the same memory requirements.

Devy99 commented 1 month ago

@BlackSamorez thank you for the prompt response! No problem and, again, thanks for your work 😁...

I have another question regarding quantization. Did you try to quantize it to 8 bit and check the accuracy of the model with a 8-bit compression? Since I don't see any reference in the paper.

Devy99 commented 1 month ago

Hi @Devy99! Unfortunately, we do not have efficient kernels for the 1x15 setup, not is it actually 1x15 in memory because there are no 15-bit data types to store those weights in. The codes are stored in int16 anyway, making it basically a 1x16 that never uses the second half of it's codebook. As such, I can make 1x15 run on the 1x16 kernels (instead of the slow kernels that you encountered) with minimal changes to the code, but at this point you would be better off to just use 1x16. 1x15 was created to mitigate the codebooks memory footprint, achieving a bitwidth that is actually close to exactly 2 bits for a fair comparison with other 2-bit methods. It was never designed to be used in practice, and we're sorry if we deceived you in that regard. As such, I would recommend you re-quantize your model as 1x16 for better performance at, alas, exactly the same memory requirements.

Hi @BlackSamorez , may I ask which part of the code (and how) should be changed to run a 1x15 scheme with the efficient kernel, as for 1x16? Since I'll stick with this configuration (1x15) I'll proceed by changing the code on my machine. Thanks!

BlackSamorez commented 1 month ago

@Devy99 To run anything between 1x9 and 1x16, including 1x15, with the efficient kernel you'll have to replace the part of this, this and this if clauses that says codebook_size == 65536 with codebooks.dtype == torch.int16.

Devy99 commented 1 month ago

@Devy99 To run anything between 1x9 and 1x16, including 1x15, with the efficient kernel you'll have to replace the part of this, this and this if clauses that says codebook_size == 65536 with codebooks.dtype == torch.int16.

Modified the code with your instructions and re-built the library from source. However, the inference is still very slow (with all the beforementioned GPUs). Is there something else I should check and that could be missing?

Devy99 commented 1 month ago

P.S. I am using both transformers library and vLLM. The first even struggle to start, while the second is seriously slow. This is the first time i try vLLM, but I noticed a difference when running your example notebook and my script: in your script, vLLM uses # GPU blocks: 9372, # CPU blocks: 2048 while mine just # GPU blocks: 2629, # CPU blocks: 512. I don't know if there is any correlation with my problem and if I can run it exclusively on GPU (I don't know if it runs something with CPU).