IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.92k stars 153 forks source link

Please comment on why the A100 specific commit makes it faster? #13

Closed Qubitium closed 1 year ago

Qubitium commented 1 year ago

Regarding: https://github.com/IST-DASLab/gptq/commit/54d35a8979b73d6ee66459b8ae99ace8061b000d

Would be nice to know the reason behind the changes that make it faster on A100 specifically versus a 3090 for example.

Thanks!

efrantar commented 1 year ago

Hi,

since the A100 has a very high memory bandwidth and only rather little non-tensor-core compute (which is what we use for matrix-vector products), the initial simple kernels only gave moderate speedup over FP16 execution on the A100 (meanwhile, on e.g. the A6000 they were pretty close to the optimal 5.3x that is expected from 3bit compression, on large matrices).

The new kernels decode two quantized weights simultaneously into a fused 2xFP16 value, using a look-up table stored in fast shared memory (which is replicated to avoid bank conflicts during reads), which are then multiplied by 2 fused FP16 inputs in a single step. This significantly reduces the relative dequantization and computation overhead of the kernel, which was quite significant on the A100. Thus the bandwidth savings through quantization translate to better overall speedups.

In general, the kernels may also be a bit faster on not as strong GPUs but there the original simple kernels already worked quite well.

Qubitium commented 1 year ago

@efrantar Thank you so much for the explanation. I think this will help others perform further hardware specific quantize optimization.