ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

Marginal Improvement Between INT8 and FP16 #168

Open alexriggio opened 1 year ago

alexriggio commented 1 year ago

I have quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.

Tested on both an A4000 and A100 GPU.

A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms

These are the components that were quant disabled:

disable bert.encoder.layer.1.intermediate.dense._input_quantizer disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0 disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1 disable bert.encoder.layer.2.output.layernorm_quantizer_0 disable bert.encoder.layer.2.output.layernorm_quantizer_1 disable bert.encoder.layer.3.attention.output.dense._input_quantizer disable bert.encoder.layer.10.attention.self.key._input_quantizer disable bert.encoder.layer.11.attention.output.dense._input_quantizer disable bert.encoder.layer.11.output.dense._input_quantizer

The debug logs from the A4000 run are attached here:

trt_logs_int8_quantization.txt

Also, it looks like there is no option to quantize the embeddings. Is there a particular reason not to quantize them?

Any insight into these results is greatly appreciated. Thanks.

Versions: Python: 3.10.9 transformers-deploy: 0.5.4 TensorRT: 8.4.1.5 Onnxruntime (GPU): 1.12.0 Cuda: 11.7