Closed Qubitium closed 1 year ago
Hi,
since the A100 has a very high memory bandwidth and only rather little non-tensor-core compute (which is what we use for matrix-vector products), the initial simple kernels only gave moderate speedup over FP16 execution on the A100 (meanwhile, on e.g. the A6000 they were pretty close to the optimal 5.3x that is expected from 3bit compression, on large matrices).
The new kernels decode two quantized weights simultaneously into a fused 2xFP16 value, using a look-up table stored in fast shared memory (which is replicated to avoid bank conflicts during reads), which are then multiplied by 2 fused FP16 inputs in a single step. This significantly reduces the relative dequantization and computation overhead of the kernel, which was quite significant on the A100. Thus the bandwidth savings through quantization translate to better overall speedups.
In general, the kernels may also be a bit faster on not as strong GPUs but there the original simple kernels already worked quite well.
@efrantar Thank you so much for the explanation. I think this will help others perform further hardware specific quantize optimization.
Regarding: https://github.com/IST-DASLab/gptq/commit/54d35a8979b73d6ee66459b8ae99ace8061b000d
Would be nice to know the reason behind the changes that make it faster on A100 specifically versus a 3090 for example.
Thanks!