IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.86k stars 150 forks source link

qweight is empty when I gave --save option #2

Closed junsoo999 closed 1 year ago

junsoo999 commented 1 year ago

As I want to get the quantized model through the GPTQ algorithm, I gave the --save option when I run the python script.

However, the qweight of each layer is empty because of pack function in Quant3Linear class. (quant.py) I think the while loop (line 147 ~ line 170) is not executed so the qweight is just an empty ndarray.

If I comment out the while loop, I can get the qweight. What is the role of the while loop? Can I just comment out and run the transformers?

efrantar commented 1 year ago

Hi, what model size are you using, the smallest one?

Quant3Linear currently assumes that all layers are evenly divisible into 1024 x 1024 blocks, which is not the case on the very smallest model, which has 768 x 768 weights. The // 1024 in this line:

https://github.com/IST-DASLab/gptq/blob/9232a476641b1848cf720dd15bd0e616cd48702d/quant.py#L148

might be causing the issue. I just tested it on 1.3B and there the weights seem non-empty for me.

In general, the while-loop is densely packing the quantized weights, i.e. this is what actually stores the weights in 3bit.

Note that if you just want the "fake-quantized" model (i.e. the weights are quantized to e.g. only 8 different values for 3bit but are still stored in float16), then you should just put a model.save_pretrained('checkpoint.pt') call at the end of opt.py / bloom.py.

junsoo999 commented 1 year ago

Thank you for your reply.

I have one more question. If I quantize the model in 4bit weight, then what is the role of the while loop? Can I comment out the while loop if I only use 4bit quantization? or should I use --save option only at bigger models?

Thank you for sharing this wonderful work!

efrantar commented 1 year ago

If you just need (fake) quantized 4bit weights (e.g. for evaluation in a different framework) then just add a model.save_pretrained('checkpoint.pt') call at the end of the main scripts as explained above. There is no need to go through any packing code.

If you actually want to save the model in compressed 4bit form, i.e. packed, you will have to adapt our corresponding 3bit code, i.e. Quant4Linear and related functions, since this repository currently only supports packing for 3bit.

PeiqinSun commented 1 year ago

Hi, efrantar. Thank you for your nice work. why only packing the weights shape greater than 1024 ? Because the implements of CUDA kernel?

efrantar commented 1 year ago

This should be fixed now with the most recent update (now we require only multiples of 32).