NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
I wanted to use polygraphy to find out which layers lost precision. I use the following command to convert model and inspect the model layers, but found almost the precisions are still FP32, are there any wrongs with my command?
convert model with fp16 and postprocessingpolygraphy convert model.onnx --fp16 --precision-constraints obey --trt-npps add_constraints.py -o model.engine --verbose > log.txt
add_constraints.py:
import tensorrt as trt
def postprocess(network):
cnt = 0
for layer in network:
if "/model/simplefeature" in layer.name or "/model/encoder" in layer.name or "/model/decoder" in layer.name or "/postprocessor" in layer.name:
if layer.precision == trt.float16:
layer.precision = trt.float32
for i in range(layer.num_outputs):
if layer.get_output_type(i) == trt.float16:
layer.set_output_type(i, trt.float32)
cnt+=1
some verbose log:[V] Loaded Module: polygraphy | Version: 0.49.0 | Path: ['/usr/local/lib/python3.10/dist-packages/polygraphy'] [V] Loaded Module: tensorrt | Version: 8.6.2 | Path: ['/usr/lib/python3.10/dist-packages/tensorrt'] [V] [MemUsageChange] Init CUDA: CPU +12, GPU +0, now: CPU 41, GPU 9820 (MiB) [V] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1042, now: CPU 1231, GPU 10902 (MiB) [V] ---------------------------------------------------------------- [V] Input filename: /home/root/code/code/dev/detr_tensorrt/dinov2det/dinov2-small-rtdetr-966-546-op16-ep351-sim.onnx [V] ONNX IR version: 0.0.8 [V] Opset version: 16 [V] Producer name: pytorch [V] Producer version: 2.0.0 [V] Domain: [V] Model version: 0 [V] Doc string: [V] Setting TensorRT Optimization Profiles [V] Input tensor: images (dtype=DataType.FLOAT, shape=(1, 3, 546, 966)) | Setting input tensor shapes to: (min=[1, 3, 546, 966], opt=[1, 3, 546, 966], max=[1, 3, 546, 966]) [V] Input tensor: orig_target_sizes (dtype=DataType.INT32, shape=(1, 2)) | Setting input tensor shapes to: (min=[1, 2], opt=[1, 2], max=[1, 2]) [I] Configuring with profiles:[ Profile 0: {images [min=[1, 3, 546, 966], opt=[1, 3, 546, 966], max=[1, 3, 546, 966]], orig_target_sizes [min=[1, 2], opt=[1, 2], max=[1, 2]]} ] [I] Building engine with configuration: Flags | [FP16, OBEY_PRECISION_CONSTRAINTS] Engine Capability | EngineCapability.DEFAULT Memory Pools | [WORKSPACE: 15388.48 MiB, TACTIC_DRAM: 13765.00 MiB] Tactic Sources | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS] Profiling Verbosity | ProfilingVerbosity.DETAILED Preview Features | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805] [V] Graph optimization time: 0.308624 seconds. [V] Global timing cache in use. Profiling results in this builder pass will be stored. [V] Detected 2 inputs and 3 output network tensors. [V] Total Host Persistent Memory: 242640 [V] Total Device Persistent Memory: 61440 [V] Total Scratch Memory: 38880768 [V] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 62 MiB, GPU 452 MiB [V] [BlockAssignment] Started assigning block shifts. This will take 132 steps to complete. [V] [BlockAssignment] Algorithm ShiftNTopDown took 16.181ms to assign 9 blocks to 132 nodes requiring 69562880 bytes. [V] Total Activation Memory: 69562880 [W] TensorRT encountered issues when converting weights between types and that could affect accuracy. [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. [W] Check verbose logs for the list of affected weights. [W] - 1 weights are affected by this issue: Detected FP32 infinity values and converted them to corresponding FP16 infinity. [W] - 218 weights are affected by this issue: Detected subnormal FP16 values. [W] - 69 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value. [W] - 6 weights are affected by this issue: Detected finite FP32 values which would overflow in FP16 and converted them to the closest finite FP16 value. [V] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +50, GPU +128, now: CPU 50, GPU 128 (MiB) [I] Finished engine building in 743.236 seconds
Description
I wanted to use polygraphy to find out which layers lost precision. I use the following command to convert model and inspect the model layers, but found almost the precisions are still FP32, are there any wrongs with my command? convert model with fp16 and postprocessing
polygraphy convert model.onnx --fp16 --precision-constraints obey --trt-npps add_constraints.py -o model.engine --verbose > log.txt
add_constraints.py: import tensorrt as trt def postprocess(network): cnt = 0 for layer in network: if "/model/simplefeature" in layer.name or "/model/encoder" in layer.name or "/model/decoder" in layer.name or "/postprocessor" in layer.name: if layer.precision == trt.float16: layer.precision = trt.float32 for i in range(layer.num_outputs): if layer.get_output_type(i) == trt.float16: layer.set_output_type(i, trt.float32) cnt+=1
some verbose log:
[V] Loaded Module: polygraphy | Version: 0.49.0 | Path: ['/usr/local/lib/python3.10/dist-packages/polygraphy'] [V] Loaded Module: tensorrt | Version: 8.6.2 | Path: ['/usr/lib/python3.10/dist-packages/tensorrt'] [V] [MemUsageChange] Init CUDA: CPU +12, GPU +0, now: CPU 41, GPU 9820 (MiB) [V] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1042, now: CPU 1231, GPU 10902 (MiB) [V] ---------------------------------------------------------------- [V] Input filename: /home/root/code/code/dev/detr_tensorrt/dinov2det/dinov2-small-rtdetr-966-546-op16-ep351-sim.onnx [V] ONNX IR version: 0.0.8 [V] Opset version: 16 [V] Producer name: pytorch [V] Producer version: 2.0.0 [V] Domain: [V] Model version: 0 [V] Doc string: [V] Setting TensorRT Optimization Profiles [V] Input tensor: images (dtype=DataType.FLOAT, shape=(1, 3, 546, 966)) | Setting input tensor shapes to: (min=[1, 3, 546, 966], opt=[1, 3, 546, 966], max=[1, 3, 546, 966]) [V] Input tensor: orig_target_sizes (dtype=DataType.INT32, shape=(1, 2)) | Setting input tensor shapes to: (min=[1, 2], opt=[1, 2], max=[1, 2]) [I] Configuring with profiles:[ Profile 0: {images [min=[1, 3, 546, 966], opt=[1, 3, 546, 966], max=[1, 3, 546, 966]], orig_target_sizes [min=[1, 2], opt=[1, 2], max=[1, 2]]} ] [I] Building engine with configuration: Flags | [FP16, OBEY_PRECISION_CONSTRAINTS] Engine Capability | EngineCapability.DEFAULT Memory Pools | [WORKSPACE: 15388.48 MiB, TACTIC_DRAM: 13765.00 MiB] Tactic Sources | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS] Profiling Verbosity | ProfilingVerbosity.DETAILED Preview Features | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805] [V] Graph optimization time: 0.308624 seconds. [V] Global timing cache in use. Profiling results in this builder pass will be stored. [V] Detected 2 inputs and 3 output network tensors. [V] Total Host Persistent Memory: 242640 [V] Total Device Persistent Memory: 61440 [V] Total Scratch Memory: 38880768 [V] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 62 MiB, GPU 452 MiB [V] [BlockAssignment] Started assigning block shifts. This will take 132 steps to complete. [V] [BlockAssignment] Algorithm ShiftNTopDown took 16.181ms to assign 9 blocks to 132 nodes requiring 69562880 bytes. [V] Total Activation Memory: 69562880 [W] TensorRT encountered issues when converting weights between types and that could affect accuracy. [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. [W] Check verbose logs for the list of affected weights. [W] - 1 weights are affected by this issue: Detected FP32 infinity values and converted them to corresponding FP16 infinity. [W] - 218 weights are affected by this issue: Detected subnormal FP16 values. [W] - 69 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value. [W] - 6 weights are affected by this issue: Detected finite FP32 values which would overflow in FP16 and converted them to the closest finite FP16 value. [V] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +50, GPU +128, now: CPU 50, GPU 128 (MiB) [I] Finished engine building in 743.236 seconds
show layers
polygraphy inspect model.engine --model-type engine --show layers > log.txt
[I] ==== TensorRT Engine ==== Name: Unnamed Network 0 | Explicit Batch EngineEnvironment
Hardware: Orin NX 16G
TensorRT Version: 8.6
Docker: dustynv/l4t-pytorch:r36.2.0