NetEase-FuXi / EETQ

Easy and Efficient Quantization for Transformers
Apache License 2.0
174 stars 14 forks source link

Understanding EETQ and 8 bit quantization #5

Closed RonanKMcGovern closed 9 months ago

RonanKMcGovern commented 9 months ago

Where can I read about how EETQ and 8bit quanting works?

What makes it faster than bitsandbytes and also bf16 inference?

What is the quantization mechanism?

Are 8bit calculations directly run in the kernels OR are the quantized values dequantized on the fly before computation? If so, why does EETQ still get a speed up versus bf16?

SidaZh commented 9 months ago

@RonanKMcGovern Thank you for your question. The earliest proposal of the weight-only quantization method should be found in this paper "Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production" . EETQ is a work that migrates the FasterTransformer kernel, so the basic principles can be found at the NV GTC2023 conference.

RonanKMcGovern commented 9 months ago

Many thanks @SidaZh , have you any data (or even an anecdotal sense) of perplexity versus 4-bit AWQ and also versus bnb nf4?

RonanKMcGovern commented 9 months ago

Ok, I find the answer to my question here: https://github.com/NetEase-FuXi/EETQ/issues/4#issuecomment-1865926480

Closing this out, many thanks.