Closed vgoklani closed 11 months ago
@vgoklani It does not support quantization from bfloat16 to int8 directly now. If you want to use it, you can first load it into an fp16 model like that model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
, and then use eetq.
We will soon open support for the bfloat16 type.
you can first load it into an fp16 model like that
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
@SidaZh is what you describe here how tgi is handling --quantize eetq right now to get it to 8bit?
I guess when you support bf16 that will improve accuracy (because float 16 results in losing high and low exponent values)?
I agree with @RonanKMcGovern
moreover, I'm confused by this:
eetq 8 bit can be regarded as a high-performance cutlass implementation of w8a16(per-channel). From the perspective of the PPL, even without correcting for outliers, the performance of the w8a16 per-channel quantization is already very good.
Could you please add more clarity to this statement; there are several techniques used by both GPTQ and AWQ to handle outliers and activations from the weights. It's not clear what's happening here, unfortunately everything is buried inside the C++ code :( Even if the cutlass implementation is more efficient at matrix multipllications, it's not clear how it handles the reduced precision from float16 -> int8.
you can first load it into an fp16 model like that
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
@SidaZh is what you describe here how tgi is handling --quantize eetq right now to get it to 8bit?
I guess when you support bf16 that will improve accuracy (because float 16 results in losing high and low exponent values)?
Yes, bf16 support should be added.
@vgoklani Int8 weight only quantization method is truely a simple per-channel + symmetric quantization, without any operations to restore accuracy. Hidden in Cutlass is the fusion of the dequantization operator and the fp16 matrix multiplication operator. Experimental results show that the LLM generation effect has very little accuracy loss for w8a16, while the accuracy loss for w4a16 cannot be ignored. Therefore, algorithm adjustments need to be made through algorithms such as awq and gptq for w4a16 or w3a16. Of course, using algorithm restoration will result in better accuracy performance. The goal of eetq here is to create a universal, easy-to-use, and efficient weight-only gemm inference backend plugin.
That's the perfect answer, thank you!
Yes, bf16 support should be added.
@SidaZh Hi, is bf16 supported now? or any plan for this feature?
I believe both the Mistral and LLama2 models were trained in bfloat16, is it still possible to use your library:
https://github.com/NetEase-FuXi/EETQ/blob/main/python/eetq/utils/quantizer.py#L21
Does this
W8A16Linear
support bfloat16?Thanks!