Closed RonanKMcGovern closed 9 months ago
@RonanKMcGovern Thank you for your question. The earliest proposal of the weight-only quantization method should be found in this paper "Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production" . EETQ is a work that migrates the FasterTransformer kernel, so the basic principles can be found at the NV GTC2023 conference.
Many thanks @SidaZh , have you any data (or even an anecdotal sense) of perplexity versus 4-bit AWQ and also versus bnb nf4?
Ok, I find the answer to my question here: https://github.com/NetEase-FuXi/EETQ/issues/4#issuecomment-1865926480
Closing this out, many thanks.
Where can I read about how EETQ and 8bit quanting works?
What makes it faster than bitsandbytes and also bf16 inference?
What is the quantization mechanism?
Are 8bit calculations directly run in the kernels OR are the quantized values dequantized on the fly before computation? If so, why does EETQ still get a speed up versus bf16?