TensorRT 10 slower than TensorRT 8.6 for models with Instance Normalization layers

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

Apache License 2.0

10.61k stars 2.11k forks source link

Description

After migrating my backend to TensorRT 10, I've noticed that some models are slower with TensorRT-10. Looks like the issue comes from the mapping on some InstanceNormalization layers that are not using the Instance Normalization plugin anymore.

Here are the logs for one layer returned by TensorRT before and after the migration:

With TensorRT 8.6: [06/24/2024-10:18:03] [V] [TRT] Parsing node: /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] [06/24/2024-10:18:03] [V] [TRT] Searching for input: /encoder/0/resnets.0/norm1/Reshape_output_0 [06/24/2024-10:18:03] [V] [TRT] Searching for input: /encoder/0/resnets.0/norm1/Constant_1_output_0 [06/24/2024-10:18:03] [V] [TRT] Searching for input: /encoder/0/resnets.0/norm1/Constant_2_output_0 [06/24/2024-10:18:03] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] inputs: [/encoder/0/resnets.0/norm1/Reshape_output_0 -> (1, 32, -1)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_1_output_0 -> (32)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_2_output_0 -> (32)[FLOAT]], [06/24/2024-10:18:03] [V] [TRT] Original shape: (1, 32, _), unsqueezing to: (_, _, _, _) [06/24/2024-10:18:03] [V] [TRT] Local registry did not find InstanceNormalization_TRT creator. Will try parent registry if enabled. [06/24/2024-10:18:03] [V] [TRT] Global registry found InstanceNormalization_TRT creator. [06/24/2024-10:18:03] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/InstanceNormalization for ONNX node: /encoder/0/resnets.0/norm1/InstanceNormalization [06/24/2024-10:18:03] [V] [TRT] Original shape: (1, 32, _, 1), squeezing to: (_, _, _) [06/24/2024-10:18:03] [V] [TRT] Registering tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 for ONNX tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 [06/24/2024-10:18:03] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] outputs: [/encoder/0/resnets.0/norm1/InstanceNormalization_output_0 -> (1, 32, -1)[FLOAT]],

With TensorRT 10: [06/24/2024-10:15:27] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] inputs: [/encoder/0/resnets.0/norm1/Reshape_output_0 -> (1, 32, -1)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_1_output_0 -> (32)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_2_output_0 -> (32)[FLOAT]], [06/24/2024-10:15:27] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/Constant_1_output_0 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/Constant_2_output_0 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Original shape: (32,), unsqueezing to: (1, 32, 1) [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_ShapeShuffle_0 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_unsqueezeTensor required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Original shape: (32,), unsqueezing to: (1, 32, 1) [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_ShapeShuffle_1 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_unsqueezeTensor_2 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/InstanceNormalization for ONNX node: /encoder/0/resnets.0/norm1/InstanceNormalization [06/24/2024-10:15:27] [V] [TRT] Registering tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 for ONNX tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 [06/24/2024-10:15:27] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] outputs: [/encoder/0/resnets.0/norm1/InstanceNormalization_output_0 -> (1, 32, -1)[FLOAT]],

Any ideas on how to bring back the instance normalization plugin so that I can reach the expected perfs?

Environment

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

Adding a builderOptimizationLevel=5 returns these issues at inference time:

assert self.context.execute_v2(bindings=bindings), "failure during execution of inference" AssertionError: failure during execution of inference [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7bc29b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7d114c0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7d8e390'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7deb0b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7e540b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7eb1720'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7f0f140'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7f6cc10'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7f7a410'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc80437b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc80abde0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc8114220'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc8185360'.) [06/24/2024-13:04:14] [TRT] [E] 1: [multiStreamContext.cpp::maybeDestroyAuxStream::263] Error Code 1: Cuda Runtime (misaligned address) [06/24/2024-13:04:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (misaligned address) [06/24/2024-13:04:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (misaligned address)

However with level 4 it seems better. Note that I've find this trtexec argument too --pluginInstanceNorm. There is an issue when running a model that have been compiled in that way with PluginInstanceNorm in the nvcr.io/nvidia/pytorch:24.05-py3 container. In fact from these lines and these lines the expected cudnn major version is 8. But this image comes with a pre-installed major cudnn version of 9. It solved by doing this ugly symlink, but this is in the TensorRT daily philosophy: ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.9 /usr/lib/x86_64-linux-gnu/libcudnn.so.8. Is there any restriction in using cudnn 9 here? looks everything works well

NVIDIA / TensorRT