Closed renne444 closed 3 months ago
BERT-like models do not support calibration. Please use TRT ModelOpt to insert Q/DQ ops into the ONNX model: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq
Thank you for your so rapidly response. I have a further question for my study. Could you please explain in detail why BERT-like models do not support calibration? Is this related to the architecture or characteristics of these models?
Usually, TRT ptq will auto insert q-dq(implict quant), and get the best performance, especially for CNNs, but for LLM/GPT-liked model, it will drop in Myelin-ForeignNode(pool performance and run at fp16) or no int8 type layer support.
I see TRT oss provide a demo about bert-liked model deploy, which use custom plugins to replace some ops like mha/ln etcs, even ASP-QAT. @nvpohanh https://github.com/NVIDIA/TensorRT/tree/release/10.2/demo/BERT
Thanks for your reply.
Description
Hello, I am performing int8 quantization on a BERT-like embedding model. I noticed that after quantization, the inference speed is much more slower than FP16, and the output of the trt engine is basically consistent with the FP32 percision. I suspect that the model has not completed int8 quantization actually.
I have tried both IInt8MinMaxCalibrator and IInt8EntropyCalibrator2, but neither worked. Also, I directly ran
trtexec --onnx="xx" --int8 --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:16x128,attention_mask:16x128 --maxShapes=input_ids:128x512attention_mask:128x512
and got similar results, with inference speed and inference results being the same as FP32. Do you have any idea?Environment
TensorRT Version:
NVIDIA GPU: A100
NVIDIA Driver Version: 525.105.17
CUDA Version: 12.5
docker image: nvcr.io/nvidia/tensorrt:24.06-py3
Embedding Model Structure
Log While Building Engine
Use MinMax calibrator