Understanding EETQ and 8 bit quantization

RonanKMcGovern commented 9 months ago

Where can I read about how EETQ and 8bit quanting works?

What makes it faster than bitsandbytes and also bf16 inference?

What is the quantization mechanism?

How are the scales and biases selected for quanting?
Are any weights protected?
I assume the quanting is data independent?

Are 8bit calculations directly run in the kernels OR are the quantized values dequantized on the fly before computation? If so, why does EETQ still get a speed up versus bf16?

SidaZh commented 9 months ago

@RonanKMcGovern Thank you for your question. The earliest proposal of the weight-only quantization method should be found in this paper "Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production" . EETQ is a work that migrates the FasterTransformer kernel, so the basic principles can be found at the NV GTC2023 conference.

"What makes it faster?" Because it fused with dequantize-op and gemm-op, reducing SRAM I/O access.
"What is the quantization mechanism?" See "weight only quantization". INT8 weight only save the weights by INT8, but activations are saved by FP16.
"Are any weights protected?" EETQ not yet, it aims to create a universal, easy-to-use, and efficient weight-only gemm inference backend plugin.
It is a simple per-channel + symmetric post training quantization(PTQ).

RonanKMcGovern commented 9 months ago

Many thanks @SidaZh , have you any data (or even an anecdotal sense) of perplexity versus 4-bit AWQ and also versus bnb nf4?

RonanKMcGovern commented 9 months ago

Ok, I find the answer to my question here: https://github.com/NetEase-FuXi/EETQ/issues/4#issuecomment-1865926480

Closing this out, many thanks.

NetEase-FuXi / EETQ

Understanding EETQ and 8 bit quantization #5