NetEase-FuXi / EETQ

Easy and Efficient Quantization for Transformers
Apache License 2.0
180 stars 14 forks source link

How to handle bfloat16? #4

Closed vgoklani closed 11 months ago

vgoklani commented 11 months ago

I believe both the Mistral and LLama2 models were trained in bfloat16, is it still possible to use your library:

https://github.com/NetEase-FuXi/EETQ/blob/main/python/eetq/utils/quantizer.py#L21

Does this W8A16Linear support bfloat16?

Thanks!

SidaZh commented 11 months ago

@vgoklani It does not support quantization from bfloat16 to int8 directly now. If you want to use it, you can first load it into an fp16 model like that model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16) , and then use eetq. We will soon open support for the bfloat16 type.

RonanKMcGovern commented 11 months ago

you can first load it into an fp16 model like that model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)

@SidaZh is what you describe here how tgi is handling --quantize eetq right now to get it to 8bit?

I guess when you support bf16 that will improve accuracy (because float 16 results in losing high and low exponent values)?

vgoklani commented 11 months ago

I agree with @RonanKMcGovern

moreover, I'm confused by this:

eetq 8 bit can be regarded as a high-performance cutlass implementation of w8a16(per-channel). From the perspective of the PPL, even without correcting for outliers, the performance of the w8a16 per-channel quantization is already very good.

Could you please add more clarity to this statement; there are several techniques used by both GPTQ and AWQ to handle outliers and activations from the weights. It's not clear what's happening here, unfortunately everything is buried inside the C++ code :( Even if the cutlass implementation is more efficient at matrix multipllications, it's not clear how it handles the reduced precision from float16 -> int8.

SidaZh commented 11 months ago

you can first load it into an fp16 model like that model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)

@SidaZh is what you describe here how tgi is handling --quantize eetq right now to get it to 8bit?

I guess when you support bf16 that will improve accuracy (because float 16 results in losing high and low exponent values)?

Yes, bf16 support should be added.

SidaZh commented 11 months ago

@vgoklani Int8 weight only quantization method is truely a simple per-channel + symmetric quantization, without any operations to restore accuracy. Hidden in Cutlass is the fusion of the dequantization operator and the fp16 matrix multiplication operator. Experimental results show that the LLM generation effect has very little accuracy loss for w8a16, while the accuracy loss for w4a16 cannot be ignored. Therefore, algorithm adjustments need to be made through algorithms such as awq and gptq for w4a16 or w3a16. Of course, using algorithm restoration will result in better accuracy performance. The goal of eetq here is to create a universal, easy-to-use, and efficient weight-only gemm inference backend plugin. image

vgoklani commented 11 months ago

That's the perfect answer, thank you!

thincal commented 5 months ago

Yes, bf16 support should be added.

@SidaZh Hi, is bf16 supported now? or any plan for this feature?