NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.74k stars 2.13k forks source link

Practical aspects about neural networks quantization with TensorRT. #2839

Closed matcosta23 closed 1 year ago

matcosta23 commented 1 year ago

Background

I am currently exploring the topic of deep learning model quantization techniques. In the official NVIDIA's Tensor RT documentation, we can see that Tensor RT supports quantization and applies it to activation and weights of the provided model.

I am working with a repository that proposes explicit (with the pytorch_quantization library) and implicit quantization (implemented by Builder object from Python API).

Explicit and Implicit Quantization

I would like to better understand the details described in the Explicit Versus Implicit Quantization section.

Explicit Quantization

When performing explicit quantization, I provide an ONNX graph with Q/DQ nodes to be converted to a '.trt' file. Can I assume that all fake quantization nodes (as seen by Netron, for example) will be replaced by real INT8 operations of the associated layers?

Implicit Quantization

Moreover, the following statement from the above documentation doesn't provide too much detail about which operations are quantized or not in implicit quantization:

"When processing implicitly quantized networks, TensorRT treats the model as a floating-point model when applying the graph optimizations, and uses INT8 opportunistically to optimize layer execution time. If a layer runs faster in INT8, then it executes in INT8. Otherwise, FP32 or FP16 is used. In this mode, TensorRT is optimizing for performance only, and you have little control over where INT8 is used - even if you explicitly set the precision of a layer at the API level, TensorRT may fuse that layer with another during graph optimization, and lose the information that it must execute in INT8. TensorRT’s PTQ capability generates an implicitly quantized network."

Possible Strategies

What strategies can I use for profiling an inference run from a serialized network generated by TensorRT?

I have already tried Nsight Systems and trtexec with profiling options, but I only get timing information with these two tools. Would you have another profiling approach that could allow me to verify the operation precision in every model layer during an evaluation process?

zerollzeng commented 1 year ago

Your question is hard to answer since it includes too much information. We have some talks in GTC about those topic, you can learn from it.

What strategies can I use for profiling an inference run from a serialized network generated by TensorRT?

In TensorRT verbose log, you can see the precision of every layers in the final engine, I think that would be a good start.

ttyio commented 1 year ago

closing since no activity for more than 3 weeks, pls reopen if you still have question, thank you!