Closed pommedeterresautee closed 2 years ago
Hello @pommedeterresautee ,
The transpose between initializer + Q/DQ
and matmul
will not hurt perf in trt. they are processed during engine in build stage.
For the quant_bert.py, this is no longer used, we would remove, they are already in upstream https://huggingface.co/docs/transformers/model_doc/qdqbert
For the INT8 slower than fp16, I have created internal bug to track this, thanks!
Thank you, indeed I have tried the new QDQBert model and it works as expected (2X faster than FP16 on 3090 RTX).
Description
When using
pytorch_quantization
with Hugging Face models, whatever the seq len, the batch size and the model, int-8 is always slower than FP16. TensorRT models are produced with trtexec (see below)Many PDQ nodes are just before a
transpose
node and then the matmul. I am under the impression it may be a source of performance issue (https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs).According to https://github.com/NVIDIA/sampleQAT/blob/master/postprocess_onnx.py:
May be linked to https://github.com/NVIDIA/TensorRT/issues/1532
Second point, it doesn't seem that bert module (https://github.com/NVIDIA/TensorRT/blob/main/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_bert.py) is enabled (https://github.com/NVIDIA/TensorRT/blob/main/tools/pytorch-quantization/pytorch_quantization/quant_modules.py#L26)
int 8 Netron quantized model screenshot
Environment
TensorRT Version: 8.2 (preview) NVIDIA GPU: 3090 RTX NVIDIA Driver Version: 495.29.05 CUDA Version: 11.5 CUDNN Version: 8.3.0.98 Operating System: Linux Ubuntu 21.04 Python Version (if applicable): 3.9 PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version): Baremetal
Relevant Files
Onnx file is too big to be attached. It can be reproduced with the script below
Steps To Reproduce
To recreate both not quantized model + quantized artefacts (need hugging face transformers + pytorch_quantization), run the notebook below (at the very end there are 2 trtexec commands).