NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.54k stars 2.1k forks source link

Quantization Error Analysis on Object Detection #1114

Closed JosephChenHub closed 3 years ago

JosephChenHub commented 3 years ago

Description

When doing INT8 PTQ (post-training quantization), we notice that the dropout of the accuracy on image classification or semantic segmentation is rather small. However, for the detectors like yolov5, there exists a huge gap of mAP between FP16 model and INT8 model. The comparison results are listed as the following, e.g., for YOLOv5s, the mAP on COCO validation set decreases from 0.362 to 0.054 (or 0.316, 0.054 is the result of original onnx graph and 0.316 the result of custom plugin of SiLU).

image

The huge gap between 0.362 and 0.054 makes the model unreliable, and we observed that the yolov5s model utilizes an activation function SiLU (SiLU(x) = x * sigmoid(x)) which may reduce the accuracy due to its nonlinearity. image

Hence, we manually write a plugin SiLU to replace the origin operator sigmoid and mul in the onnx graph, which looks like silu silu2

As a result, we can see from the above table that our detector keeps stable and there still exists a little gap between the FP16 models and the INT8 model (for yolov5s, the mAP/FPS improves from 0.054/464 to 0.316/327 due to our custom plugin). Furthermore, we do partial quantization (use top 5 sensitive conv. layers) and the mAP is 0.332, which is still lower than 0.362. So questions come that

  1. In the origin implementation, we see the log Layer(PointWiseV2): PWN(Sigmoid_213, Mul_214), Tactic: 24, 345[Int8(128,40,40)] -> 347[Int8(128,40,40)] , and how the TensorRT implements the INT8 operation on this layer (sigmoid + mul) ?
  2. In our own custom plugin, we implement the SiLU as
    
    template <typename T>
    __device__ T _silu_op(const T input);

template <> device forceinline float _silu_op(const float input) { return input / (static_cast(1) + expf(-input)); } template <> device forceinline half _silu_op(const half input) { typedef half T; return input / (static_cast(1) + hexp(-input)); }

template global void _kernel(const T restrict inputs, T restrict outputs, const size_t n ) { const size_t tid = threadIdx.x + blockIdx.x * blockDim.x; if (tid < n) { outputs[tid] = _silu_op(inputs[tid]); }
}


does there exist a faster way to achieve this op?  or TensorRT will optimize this op like `ReLU`, `Sigmoid` in future ?
3. We notice that the sensitive layers usually are the first or last several conv. layers in the CNN, and the nonlinear modules are also sensitive. Is the error of PTQ mainly caused by the nonlinearity when doing linear quantization?  How to prevent that ? As far as we know, the QAT (quantization aware training) of TensorRT is still not applicable. The dropout of accuracy makes the QAT more desirable. When will TensorRT support QAT well ? 

## Environment

**TensorRT Version**: 7.2.1
**NVIDIA GPU**:  2080 Ti
**NVIDIA Driver Version**: 10.2
**CUDA Version**: 10.2
**CUDNN Version**: 8.0
**Operating System**: Ubuntu 18.04
**Python Version (if applicable)**: 3.6
**Tensorflow Version (if applicable)**: 
**PyTorch Version (if applicable)**: 1.8.0
**Baremetal or Container (if so, version)**: 

## References
yolov5: https://github.com/ultralytics/yolov5
ttyio commented 3 years ago

Hello @JosephChenHub , good work and thanks for sharing this analysis!

answers to your question:

  1. We have a code generator to fuse the pointwise operation since 7.0 release. On some platform this can be further fused with the neighbor conv/gemm.
  2. We will keep enhance the code generator for pointwise op works instead of adding specific kernels for new activations like silu. BTW, could you use relu to replace the silu, see https://github.com/NVIDIA/TensorRT/issues/997, not sure if it also helps in your case.
  3. the QAT support is in the next major release (around 2 months). Currently we have polygraphy to debug accuracy issue, and we plan to add some developer guide on how to debug accuracy issue. Also there is some opensource code that might give us some intuition on how to find these sensitive layers, see --sensitivity in https://github.com/NVIDIA/NeMo/blob/main/examples/asr/quantization/speech_to_text_quant_infer.py, it is not generalized solution, but we see it works for networks like quartzNet, efficientNet.
JosephChenHub commented 3 years ago

Hello @JosephChenHub , good work and thanks for sharing this analysis!

answers to your question:

  1. We have a code generator to fuse the pointwise operation since 7.0 release. On some platform this can be further fused with the neighbor conv/gemm.
  2. We will keep enhance the code generator for pointwise op works instead of adding specific kernels for new activations like silu. BTW, could you use relu to replace the silu, see #997, not sure if it also helps in your case.
  3. the QAT support is in the next major release (around 2 months). Currently we have polygraphy to debug accuracy issue, and we plan to add some developer guide on how to debug accuracy issue. Also there is some opensource code that might give us some intuition on how to find these sensitive layers, see --sensitivity in https://github.com/NVIDIA/NeMo/blob/main/examples/asr/quantization/speech_to_text_quant_infer.py, it is not generalized solution, but we see it works for networks like quartzNet, efficientNet.

OK, hope TensorRT more powerful !

glenn-jocher commented 3 years ago

@JosephChenHub @ttyio thanks for the YOLOv5 quantization analysis and response. One point I might add is that SiLU (sometimes referred to as Swish) is a very common activation layer used in various vision AI models.

In YOLOv5 we originally used nn.LeakyReLU(0.1) (inherited from Joseph Redmon's YOLOv3), and were able to realize about +1 AP gain when moving from nn.LeakyReLU(0.1) to nn.SiLU(), with little training or inference speed/memory penalties.

Considering the QAT problem overall the SiLU/LeakyReLU(0.1) compromise is unclear to me as to whether a nn.LeakyReLU(0.1) implementation may be more favorable when targeting TensorRT deployments given the above shortcomings described by @ttyio.

Disclaimer: not a TensorRT or QAT expert.

EDIT: to modify the YOLOv5 baseline activation function you would change L36 in Conv() module in yolov5/models/common.py. Naturally models must be retrained with this change, this is an a change that can be made at export on SiLU-trained models. https://github.com/ultralytics/yolov5/blob/ed2c74218d6d46605cc5fa68ce9bd6ece213abe4/models/common.py#L30-L43

ttyio commented 3 years ago

@JosephChenHub , in case you are not aware of, besides min-max, entropy, there is another percentile-max calibrator you can play with, a 99.99% percentile max sample is in

https://github.com/NVIDIA/TensorRT/blob/master/demo/BERT/helpers/calibrator.py#L102

I will close this issue, please reopen if you still have question, thanks!

H19012 commented 3 years ago

@JosephChenHub , in case you are not aware of, besides min-max, entropy, there is another percentile-max calibrator you can play with, a 99.99% percentile max sample is in

https://github.com/NVIDIA/TensorRT/blob/master/demo/BERT/helpers/calibrator.py#L102

I will close this issue, please reopen if you still have question, thanks!

Were you able to get better INT8 mAP by setting a different value in the calibrator.py? BTW YoloV5 3.0 has hardswish, which suffers from the same issues.

m-guise commented 2 years ago

Hi @ttyio It's been a year since you posted about QAT support. Any updates on when it will be released?