NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Apache License 2.0
10.15k stars 2.08k forks source link

GroupNormalization plugin failure of TensorRT when running trtexec on GPU A4000 #3950

Open appearancefnp opened 2 weeks ago

appearancefnp commented 2 weeks ago


Hey guys! I wanted to upgrade from TensorRT 8.6 to 10.0. I have a ONNX model that contains GroupNormalization plugin. It creates a serialized version, but it fails when deserializing the model while trying to load cudnn 8 instead of cudnn 9.


Using docker: nvcr.io/nvidia/tensorrt:24.05-py3

TensorRT Version: 10.0.1


NVIDIA Driver Version: 550.67

CUDA Version: 12.4

CUDNN Version: 9.1 (per container documentation)

Operating System:

Python Version (if applicable): -

Tensorflow Version (if applicable): -

PyTorch Version (if applicable): -

Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:24.05-py3

Relevant Files

Model link: https://drive.google.com/file/d/1vmGZpWJ_1sfz2ejbZoO3fFaR5udxOLTi/view?usp=sharing

Steps To Reproduce

  1. Run trtexec: trtexec --onnx=model.onnx
  2. trtexec builds the engine
    [06/17/2024-14:57:28] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 3 MiB, GPU 1984 MiB
    [06/17/2024-14:57:28] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 3059 MiB
    [06/17/2024-14:57:28] [I] Engine built in 886.712 sec.
    [06/17/2024-14:57:28] [I] Created engine with size: 55.3649 MiB
    [06/17/2024-14:57:28] [I] [TRT] Loaded engine size: 55 MiB
    [06/17/2024-14:57:28] [I] Engine deserialized in 0.0301295 sec.
    [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8. plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8. plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8. plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8. plugin/common/cudnnWrapper.cpp:90

... [06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +156, now: CPU 1, GPU 199 (MiB) [06/17/2024-14:57:28] [I] Setting persistentCacheLimit to 0 bytes. [06/17/2024-14:57:28] [I] Created execution context with device memory size: 155.537 MiB [06/17/2024-14:57:28] [I] Using random values for input images [06/17/2024-14:57:28] [I] Input binding for images with dimensions 1x500x1000x3 is created. [06/17/2024-14:57:28] [I] Output binding for class_heatmaps with dimensions 1x5x125x250 is created. [06/17/2024-14:57:28] [I] Starting inference [06/17/2024-14:57:28] [F] [TRT] Validation failed: mBnScales != nullptr && mBnScales->mPtr != nullptr plugin/groupNormalizationPlugin/groupNormalizationPlugin.cpp:132

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [E] Error[2]: [pluginV2DynamicExtRunner.cpp::execute::115] Error Code 2: Internal Error (Assertion pluginUtils::isSuccess(status) failed. ) [06/17/2024-14:57:28] [E] Error occurred during inference

**Commands or scripts**:  
trtexec --onnx=model.onnx

**Have you tried [the latest release](https://developer.nvidia.com/tensorrt)?**: yes
lix19937 commented 1 week ago

Can you upload full log with trtexec --onnx=model.onnx --verbose ?

appearancefnp commented 1 week ago

@lix19937 trtexec.log

lix19937 commented 1 week ago

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8. plugin/common/cudnnWrapper.cpp:90

Make sure libcudnn.so load successed. Add path to LD_LIBRARY_PATH.

appearancefnp commented 2 days ago

[06/17/2024-14:57:28] [E] [TRT] std::exception [06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8. plugin/common/cudnnWrapper.cpp:90

Make sure libcudnn.so load successed. Add path to LD_LIBRARY_PATH.

The problem is that the NVIDIA container contains cudnn 9.1.0, but the plugin is trying to load libcudnn.so.8. There is a version mismatch, not that cudnn is not available.

lix19937 commented 2 days ago

You should make sure your env has one cudnn, and why your nvinfer plugin will load cudnn.8.0 ?

appearancefnp commented 1 day ago

This is not my plugin - this is the plugin provided in this repo - https://github.com/NVIDIA/TensorRT/tree/release/10.1/plugin/groupNormalizationPlugin

And it loads cudnn 8, not 9 because it has the wrong macro defined here: https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/common/cudnnWrapper.cpp#L26

lix19937 commented 1 day ago

From https://github.com/NVIDIA/TensorRT/tree/release/10.0, trt version, cudnn recommend follow

TensorRT GA build

TensorRT v10.0.1.6 Available from direct download links listed below System Packages

CUDA Recommended versions: cuda-12.2.0 + cuDNN-8.9 cuda-11.8.0 + cuDNN-8.9 GNU make >= v4.1 cmake >= v3.13 python >= v3.8, <= v3.10.x pip >= v19.0 Essential utilities git, pkg-config, wget

map to https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/common/cudnnWrapper.cpp#L26-L42

You can try to creat a soft link ln -s libcudnn.so.9 libcudnn.so.8.

appearancefnp commented 2 hours ago

Why does the container include cudnn 9 then? https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html#rel-24-06

If TensorRT doesn't work in an NVIDIA container with cudnn 9, why does it ship with it?