Dear author.
Thanks for your amazing job. We are tying to apply this job to our own model.
I want to ask what's the difference between fake quant and real quant.
The reason I want to ask this is the w3a16 llama2-7b-chat model fake quantized by OmniQuant has a slower inference time than fp16 model by using transformers.
Dear author. Thanks for your amazing job. We are tying to apply this job to our own model. I want to ask what's the difference between fake quant and real quant. The reason I want to ask this is the w3a16 llama2-7b-chat model fake quantized by OmniQuant has a slower inference time than fp16 model by using transformers.