Closed junsoo999 closed 1 year ago
Hi, what model size are you using, the smallest one?
Quant3Linear
currently assumes that all layers are evenly divisible into 1024 x 1024
blocks, which is not the case on the very smallest model, which has 768 x 768
weights. The // 1024
in this line:
https://github.com/IST-DASLab/gptq/blob/9232a476641b1848cf720dd15bd0e616cd48702d/quant.py#L148
might be causing the issue. I just tested it on 1.3B and there the weights seem non-empty for me.
In general, the while-loop is densely packing the quantized weights, i.e. this is what actually stores the weights in 3bit.
Note that if you just want the "fake-quantized" model (i.e. the weights are quantized to e.g. only 8 different values for 3bit but are still stored in float16), then you should just put a model.save_pretrained('checkpoint.pt')
call at the end of opt.py
/ bloom.py
.
Thank you for your reply.
I have one more question. If I quantize the model in 4bit weight, then what is the role of the while loop? Can I comment out the while loop if I only use 4bit quantization? or should I use --save option only at bigger models?
Thank you for sharing this wonderful work!
If you just need (fake) quantized 4bit weights (e.g. for evaluation in a different framework) then just add a model.save_pretrained('checkpoint.pt')
call at the end of the main scripts as explained above. There is no need to go through any packing code.
If you actually want to save the model in compressed 4bit form, i.e. packed, you will have to adapt our corresponding 3bit code, i.e. Quant4Linear
and related functions, since this repository currently only supports packing for 3bit.
Hi, efrantar. Thank you for your nice work. why only packing the weights shape greater than 1024 ? Because the implements of CUDA kernel?
This should be fixed now with the most recent update (now we require only multiples of 32).
As I want to get the quantized model through the GPTQ algorithm, I gave the --save option when I run the python script.
However, the qweight of each layer is empty because of pack function in Quant3Linear class. (quant.py) I think the while loop (line 147 ~ line 170) is not executed so the qweight is just an empty ndarray.
If I comment out the while loop, I can get the qweight. What is the role of the while loop? Can I just comment out and run the transformers?