Missing scale and zeropoint for lot of layers on calibrating vision transformer onnx model for INT8 precision

Shalini194 commented 3 months ago

Description

I generated calibration cache for Vision Transformer onnx model using EntropyCalibration2 method. When trying to generate engine file using cache file for INT8 precision using trtexec, got a lot of missing scale and zeropoint warnings for Constant, Shuffle and SoftMax layers.

[08/08/2024-14:59:41] [I] [TRT] Reading Calibration Cache for calibrator: EntropyCalibration2 [08/08/2024-14:59:41] [I] [TRT] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales. [08/08/2024-14:59:41] [I] [TRT] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache. [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 7) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 8) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 9) [Shuffle]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 10) [Shuffle]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 20) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 23) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 26) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor /backbone/layers.0/attn/Softmax_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer 31) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [08/08/2024-14:59:41] [W] [TRT] Missing scale and zero-point for tensor (Unnamed Layer* 43) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor

I have observed similar warning for SoftMax layer for ResNet18 model. I would like to know the reasons for these warnings. Is int8 support for these layers not present in tensorrt ?

My end goal is to run the entire model with INT8 precision on nvidia hardware. Can you please suggest how can I achieve this ?

Environment

TensorRT Version: 8.6.2.3

NVIDIA GPU: Orin Nano 8 GB

CUDA Version: 12.2

CUDNN Version: 8.9.4

Operating System: Jetpack 6.0

Python Version : 3.10

PyTorch Version : 2.3.0

lix19937 commented 3 months ago

How do you perform quantification ? trtexec ptq ?

lix19937 commented 3 months ago

If you use trtexec -int8 (ptq), softmax run on float type. BTW, for ViT liked model, it better to use modelopt.

Shalini194 commented 3 months ago

Thanks for prompt response.

Used this calibration script to generate calib cache file for ResNet18. Cache file gets generated with the below warning.

Generated the engine with the above cache file using this command: trtexec --onnx=resnet18.onnx --saveEngine=resnet18.engine --int8 --calib=resnet18.cache --useCudaGraph --dumpLayerInfo --profilingVerbosity=detailed

I find that the Gemm and Softmax layers are running in FP32.

I am converting pretrained pytorch model -> ONNX model -> TRT. Is it possible to achieve int8 inference for Gemm and softmax layers using modelopt?

Should we use ONNX PTQ in model optimizer?

Shalini194 commented 3 months ago

Hi @lix19937 ,

I performed PTQ on ResNet18 model using modelopt ONNX PTQ and dumped the layer info. I find that Gemm and Softmax layers are still in FP32.

Screenshot 2024-08-14 221726

For ViT using modelopt, I could find only 3 layers under layer info and the Top-1/Top-5 accuracy is 0/3 respectively.

Layer 1 - LayerType: Reformat, FP32 input, INT8 output Layer 2 - LayerType: CaskConvolution, INT8 input , FP32 output Layer 3 - LayerType: Myelin, FP32 input, FP32 output

All the nodes after the first Conv layer are fused as foreign node with layer type Myelin and both input/output is FP32. Does it mean that all the layers after the first Conv are run in Float ?

Attaching both resnet18 and vit layer info log for your reference. resnet18_onnxptq_trtexec_int8.txt vit_onnxptq_trtexec_int8.txt

Can you please clarify why such layers are run in float even after applying PTQ using modelopt ?

lix19937 commented 3 months ago

Is it possible to achieve int8 inference for Gemm and softmax layers using modelopt?

Quantization of ['Add', 'AveragePool', 'BatchNormalization', 'Clip', 'Conv', 'ConvTranspose', 'Gemm', 'GlobalAveragePool', 'MatMul', 'MaxPool', 'Mul'] op types are supported.

For your resnet18 model, Gemm can be run int8, softmax can not.

Shalini194 commented 3 months ago

@lix19937 ,

For Gemm layer, input datatype is int8 and output datatype is fp32. What will be the computation precision in this case ?

Tried to generate tensorrt engines for a couple of classification, segmentation and transformer models.

The following list of layers always runs in FP32 :

Concat, Transpose, Reshape, Gather, Resize, ArgMax, LayerNormalization, InstanceNormalization, SoftMax

And few layers occasionally run in int8 and sometimes fp32 :

Add, Expand, Pow, MatMul, Erf, etc

Tried both implicit quantization using calibration cache and explicit quantization using ModelOpt ONNX PTQ. I see that these ops are not present in the supported op list of ModelOpt that you shared. Is there any other possibility to perform int8 inference for these layers ?

I understand that some layers like Reshape, Gather, Concat etc are memory operations and does not perform any computation & don't need quantization. But, creating tensorrt int8 engine and deploying on hardware doesn't mean purely quantized engine file with all layers running in int8 precision ?

lix19937 commented 3 months ago

For Gemm layer, input datatype is int8 and output datatype is fp32. What will be the computation precision in this case ?

int8-in-fp32/fp16-out, for example, C[m][n] += A[m][k] * B[k][n] , here A and B is int8, the accu_sum C is fp32/fp16.

But, creating tensorrt int8 engine and deploying on hardware doesn't mean purely quantized engine file with all layers running in int8 precision ?

Yes, Usually, if you use trtexec to build engine, you use --int8 or better use --best; --int8 Enable int8 precision, in addition to fp32
--best Enable all precisions to achieve the best performance.

So int8 engine and deploying on hardware doesn't mean purely quantized engine file with all layers running in int8 precision.

lix19937 commented 3 months ago

Thanks for prompt response.

Used this calibration script to generate calib cache file for ResNet18. Cache file gets generated with the below warning.

Generated the engine with the above cache file using this command: trtexec --onnx=resnet18.onnx --saveEngine=resnet18.engine --int8 --calib=resnet18.cache --useCudaGraph --dumpLayerInfo --profilingVerbosity=detailed

I find that the Gemm and Softmax layers are running in FP32.

I am converting pretrained pytorch model -> ONNX model -> TRT. Is it possible to achieve int8 inference for Gemm and softmax layers using modelopt?

Should we use ONNX PTQ in model optimizer?

By pytorch_quantization get follow:

Shalini194 commented 3 months ago

For Gemm layer, input datatype is int8 and output datatype is fp32. What will be the computation precision in this case ?

int8-in-fp32/fp16-out, for example, C[m][n] += A[m][k] * B[k][n] , here A and B is int8, the accu_sum C is fp32/fp16.

But, creating tensorrt int8 engine and deploying on hardware doesn't mean purely quantized engine file with all layers running in int8 precision ?

Yes, Usually, if you use trtexec to build engine, you use --int8 or better use --best; --int8 Enable int8 precision, in addition to fp32 --best Enable all precisions to achieve the best performance.

So int8 engine and deploying on hardware doesn't mean purely quantized engine file with all layers running in int8 precision.

@lix19937 , Thanks for the clarification.

Based on my understanding, Tensorrt autotunes tensor types to end up with fastest engine. If a layer runs faster in INT8 and has assigned quantization scales on its data inputs and outputs, then a kernel with INT8 precision is assigned to that layer. Otherwise, a high-precision FP32 kernel is assigned.

So, can i conclude that due to some reasons (might not be an optimal fit, accuracy issues in int8, reduce quantization noise), few layers such as LayerNormalization, SoftMax, Concat, Reshape, etc are retained in FP32 in order to create an optimal engine ?

Shalini194 commented 3 months ago

Thanks for prompt response. Used this calibration script to generate calib cache file for ResNet18. Cache file gets generated with the below warning. Generated the engine with the above cache file using this command: trtexec --onnx=resnet18.onnx --saveEngine=resnet18.engine --int8 --calib=resnet18.cache --useCudaGraph --dumpLayerInfo --profilingVerbosity=detailed I find that the Gemm and Softmax layers are running in FP32. I am converting pretrained pytorch model -> ONNX model -> TRT. Is it possible to achieve int8 inference for Gemm and softmax layers using modelopt? Should we use ONNX PTQ in model optimizer?

By pytorch_quantization get follow:

Thanks a lot for sharing your inputs. I am able to quantize GEMM layer for ResNet18 model using ModelOpt ONNX PTQ.

YixuanSeanZhou commented 3 months ago

@Shalini194 Quick question: when doing modelopt quantization for ResNet18, did you see any regression in accuracy when running the model using TensorRT (comparing to OnnxRT or Pytorch)?

Were you quantizing the Conv layers? And in the final TRT engine, were the Conv and Batchnorm / Relu being fused together as a single INT8 op?

Thanks!

Shalini194 commented 3 months ago

@Shalini194 Quick question: when doing modelopt quantization for ResNet18, did you see any regression in accuracy when running the model using TensorRT (comparing to OnnxRT or Pytorch)?

Were you quantizing the Conv layers? And in the final TRT engine, were the Conv and Batchnorm / Relu being fused together as a single INT8 op?

Thanks!

@YixuanSeanZhou ,

I was able to retain the Top-1 and Top-5 accuracy for the ModelOpt quantized ResNet18 compared to ONNX and PyTorch.

Yes, all the Conv layers were quantized. In the final TensorRT engine, Conv+BN+ReLU are fused into a single INT8 operation, with all layers running in INT8 except for Softmax. You can refer to the ResNet18 logs I shared in previous comments for more details. Thanks.

lix19937 commented 3 months ago

So, can i conclude that due to some reasons (might not be an optimal fit, accuracy issues in int8, reduce quantization noise), few layers such as LayerNormalization, SoftMax, Concat, Reshape, etc are retained in FP32 in order to create an optimal engine ?

LayerNormalization, SoftMax usually run in fp32, shape-type ops(like Concat, Reshape, Permute) follow the previous layer type.

Shalini194 commented 3 months ago

So, can i conclude that due to some reasons (might not be an optimal fit, accuracy issues in int8, reduce quantization noise), few layers such as LayerNormalization, SoftMax, Concat, Reshape, etc are retained in FP32 in order to create an optimal engine ?

LayerNormalization, SoftMax usually run in fp32, shape-type ops(like Concat, Reshape, Permute) follow the previous layer type.

@lix19937, Thanks again for clarifying. For vision transformer-like models using ModelOptimizer, all the layers after the first Conv layer are mapped as a single layer with the layer type 'Myelin.' The rest of the graph contains multiple MatMul and Add layers that were quantized. How does it run as a single layer with multiple QDQ nodes ?

NVIDIA / TensorRT

Missing scale and zeropoint for lot of layers on calibrating vision transformer onnx model for INT8 precision #4068

Description

Environment