fpgaminer / GPTQ-triton

GPTQ inference Triton kernel
Apache License 2.0
278 stars 21 forks source link

question about the quantization formula #18

Open irasin opened 1 year ago

irasin commented 1 year ago

the weights are decoded using the formula w = (w - z - 1) * s.

I wonder why we need to use z - 1 here since the normal quantization is w = (w - z) * s

fpgaminer commented 1 year ago

When the values are packed, zero is stored as (zero + 1).

irasin commented 1 year ago

Hi, @fpgaminer, thanks.

I want to know why we need to pack the values as (zero + 1)? Is it for any numerical considerations?

fpgaminer commented 1 year ago

Sorry, I mistyped, zero is stored as zero - 1.

It's more clear in the non-simplified formula which is: w = w * s - (z + 1) * s

As to why, well it seems the quantization algorithm outputs zero in the range [1, 2**bits]. As to why, I don't know. You can reference gptq.py to study the implementation of the quantization algorithm, or query the paper authors. My main focus is just on the kernel side of things.