Closed Devy99 closed 1 month ago
Hi @Devy99!
Unfortunately, we do not have efficient kernels for the 1x15
setup, not is it actually 1x15
in memory because there are no 15-bit data types to store those weights in. The codes are stored in int16
anyway, making it basically a 1x16
that never uses the second half of it's codebook. As such, I can make 1x15
run on the 1x16
kernels (instead of the slow kernels that you encountered) with minimal changes to the code, but at this point you would be better off to just use 1x16
.
1x15
was created to mitigate the codebooks memory footprint, achieving a bitwidth that is actually close to exactly 2 bits for a fair comparison with other 2-bit methods. It was never designed to be used in practice, and we're sorry if we deceived you in that regard.
As such, I would recommend you re-quantize your model as 1x16
for better performance at, alas, exactly the same memory requirements.
@BlackSamorez thank you for the prompt response! No problem and, again, thanks for your work 😁...
I have another question regarding quantization. Did you try to quantize it to 8 bit and check the accuracy of the model with a 8-bit compression? Since I don't see any reference in the paper.
Hi @Devy99! Unfortunately, we do not have efficient kernels for the
1x15
setup, not is it actually1x15
in memory because there are no 15-bit data types to store those weights in. The codes are stored inint16
anyway, making it basically a1x16
that never uses the second half of it's codebook. As such, I can make1x15
run on the1x16
kernels (instead of the slow kernels that you encountered) with minimal changes to the code, but at this point you would be better off to just use1x16
.1x15
was created to mitigate the codebooks memory footprint, achieving a bitwidth that is actually close to exactly 2 bits for a fair comparison with other 2-bit methods. It was never designed to be used in practice, and we're sorry if we deceived you in that regard. As such, I would recommend you re-quantize your model as1x16
for better performance at, alas, exactly the same memory requirements.
Hi @BlackSamorez , may I ask which part of the code (and how) should be changed to run a 1x15
scheme with the efficient kernel, as for 1x16
? Since I'll stick with this configuration (1x15
) I'll proceed by changing the code on my machine. Thanks!
@Devy99 To run anything between
1x9
and1x16
, including1x15
, with the efficient kernel you'll have to replace the part of this, this and thisif
clauses that sayscodebook_size == 65536
withcodebooks.dtype == torch.int16
.
Modified the code with your instructions and re-built the library from source. However, the inference is still very slow (with all the beforementioned GPUs). Is there something else I should check and that could be missing?
P.S. I am using both transformers library and vLLM. The first even struggle to start, while the second is seriously slow. This is the first time i try vLLM, but I noticed a difference when running your example notebook and my script: in your script, vLLM uses # GPU blocks: 9372, # CPU blocks: 2048 while mine just # GPU blocks: 2629, # CPU blocks: 512. I don't know if there is any correlation with my problem and if I can run it exclusively on GPU (I don't know if it runs something with CPU).
Hello, I want to report and ask for suggestions regarding the quantization of CodeLlama 7B with AQLM. I followed the instruction proposed in the README file, and quantized this model with the 1x15 g=8 setting, resulting in a quantized model of ~2.02bpw.
I converted the model to the .bin and .safetensors formats and skipped the next fine-tuning step, just to try the model in inference mode. However, I experienced an incredible decrease in inference speed, both using the transformers library and vLLM (about ~1.58 token per second, using the reference notebook in the repo ).
I tested the quantized model on several GPUs (RTX 3090, RTX A5000, and A100 40GB) but in all cases I failed to speed-up the inference time (about 1 minute and half for a single response with 1k max tokens). For vLLM, I am also using the suggested CUDA and pytorch configurations.
However, when I run your model "ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16" with the same settings, the inference speed is considerable, as reported in your README.
Did I miss something?