Closed Njuapp closed 4 years ago
Hi, similar issue to #90 and #86. In short, quantization implemented in Q8BERT is only a simulation and ops still happen in fp32 precision even though the values are int8. To get performance gain optimized kernels making use of specialized hardware for quantization.
I have tried inference using quantized BERT, on MRPC dev set it only reduces time from 1:58 to 1:45, which is minor speed improvement.
I am not sure whether the reason lies in that I do not have INT8 GEMM specialized hardware.