IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.81k stars 145 forks source link

Application to T5 / UL2 family #8

Open iiLaurens opened 1 year ago

iiLaurens commented 1 year ago

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

efrantar commented 1 year ago

Hi,

in principle, we would expect GPTQ to work on most models. However, applying it to T5 models will require some additional implementation work since these are I think encoder-decoder models, which means that a memory and compute efficient GPTQ implementation (similar to the current one in the repository) would probably require sequentially traversing both the encoder and the decoder branch in parallel. See opt_sequential() or bloom_sequential() in opt.py and bloom.py for how we have implemented this sequential pass for decoder-only models.

qwopqwop200 commented 1 year ago

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

I tried GPTQ quantization of FALN-T5 and it seems to work successfully. https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 Additionally, I confirmed that FLAN-UL2 also works. I haven't done exact benchmarks, but it works pretty impressively.

iiLaurens commented 1 year ago

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made? Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

I tried GPTQ quantization of FALN-T5 and it seems to work successfully. https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 Additionally, I confirmed that FLAN-UL2 also works. I haven't done exact benchmarks, but it works pretty impressively.

That's great to hear! Did you need to do anything in particular to get this to work? Did you just run GPTQ on the encoder and decoder seperately, as @efrantar seemed to suggest?

johnrobinsn commented 1 year ago

@qwopqwop200 this is great! Thanks much... I was able to quantize flan-t5-small... but ran into this error when trying to quantize flan-ul2...

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 2985 is not positive-definite).

this is the command I used...

python t5.py google/flan-ul2 wikitext2 --wbits 4 --act-order --groupsize 128 --save ul2-4bit-128g.pt

any ideas?

efrantar commented 1 year ago

This is a numerics error due to a layer-Hessian not being positive-definite, you could try to apply higher dampening --percdamp or use more calibration data --nsamples to make the Hessian more clearly positive-definite.

johnrobinsn commented 1 year ago

@efrantar, thanks for your feedback earlier in the thread.

I see a fairly significant drop off in performance using an attempt at 4-bit quantization on t5* models using the https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 branch as compared to int8 quantization.

Some details of the perf gap are captured here. https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/157#issuecomment-1503140441

I am trying to understand your earlier comment in this thread and the quantization code in the referenced repo in the hope of improving the int-4 quant performance on these encoder/decoder models.

This repo appears to quantize the encoder layers (layer by layer) and then does the decoder layers (layer by layer) one after the other. You mentioned needing to do these encoder and decoder quantization in parallel (maybe I'm misunderstanding), but can you help me understand this point a bit more and why they might need to be done in parallel?

Also, any other insight would be appreciated. Thanks!

efrantar commented 1 year ago

I had a slightly different encoder-decoder architecture in mind when I suggested the parallel processing of branches; for T5 specifically, quantizing first the encoder and then the decoder should be correct.

We are also currently looking at some encoder-decoder models in the context of ongoing research projects; if we find anything that could be relevant for quantizing T5 with GPTQ, I will post an update here.