IntelLabs / nlp-architect

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
https://intellabs.github.io/nlp-architect
Apache License 2.0
2.94k stars 447 forks source link

question: [Why it is unable to improve inference speed after quantization] #162

Closed Njuapp closed 4 years ago

Njuapp commented 4 years ago

I have tried inference using quantized BERT, on MRPC dev set it only reduces time from 1:58 to 1:45, which is minor speed improvement.

I am not sure whether the reason lies in that I do not have INT8 GEMM specialized hardware.

ofirzaf commented 4 years ago

Hi, similar issue to #90 and #86. In short, quantization implemented in Q8BERT is only a simulation and ops still happen in fp32 precision even though the values are int8. To get performance gain optimized kernels making use of specialized hardware for quantization.