Closed MaxwellWjj closed 1 year ago
Hi @MaxwellWjj,
Thanks for your interests in our work! Yes, our results are based on fake quantization. Similar to Q8BERT
and Outlier Suppression
, it is a common practice to use fake quantization for accuracy evaluation and we can assure that the multiplication results after such operations can meet the requirements of the ANT and OliVe frameworks.
To run real quantization on GPU, one could write corresponding CUDA kernels to achieve simulation or even acceleration. And if the kernel is implemented correctly, the model's accuracy should not vary significantly from that of fake quantization.
Hi @Sakits,
Thank you very much for your answer! Hope your team can optimize the framework to support real Flint&Abfloat implementation matching the accuracy performance in the future, without the fake quantization (because as we know it does not match the real 'simple' hardware and we still need FP32 units if we want to achieve this PTQ performance)
The open-source codes helped me a lot and thanks again :)
Thanks for the open-source Olive and ANT frameworks! I have a question about the quantization process: Did you use the fake quantization? I checked the codes and I found that before computation, you dequantized all the inputs and weights back to FP32 precision, which is like this plot (ref: I-BERT Figure. 1):
The source codes locate in _quantmodules.py
My understanding is that the input tensor is based on FP32, and you divide the tensor based on the scale to fit the
quant_grid
range, then quantize the tensor by looking up from thequant_grid
. But, at the end of this quantizer forward function, you multiply all the tensor values with scale, which means all of them will be dequantized back to FP32 for further computation.If so, I think the software codes are inconsistent with the hardware design because, in the accelerator, you claimed that it is based on Exp4+Int4, not the FP32.
In other words, let me use
class Conv2dQuantizer
as the example (line 389, _quantmodules.py):The weight and input should be consistent with your real values in Table 4:
unless the software codes may not fit the hardware design... But what I found from your codes is, you dequantized everything back to FP32 (original tensor format) before computing.
So, I am wondering if all of your accuracy results are based on dequantization (fake quantization) instead of real hardware.