trtexec: onnx to tensorRT convertion fails

Description

hi, i have an onnx model which i want to convert using trtexec:

[05/23/2024-21:39:30] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float.
[05/23/2024-21:39:30] [E] [TRT] onnx2trt_utils.cpp:748: Found unsupported datatype (8) when importing initializer: classes
[05/23/2024-21:39:30] [E] [TRT] ModelImporter.cpp:774: ERROR: ModelImporter.cpp:116 In function parseGraph:
[8] Assertion failed: convertOnnxWeights(initializer, &weights, ctx) && "Failed to import initializer."

it seems the onnx model is currently not supported with those datatypes - how to convert (what tool + settings) the model so tensorRT can use it?

Environment

TensorRT Version: 8.6.1

NVIDIA GPU: Nvidia T4

NVIDIA Driver Version:

CUDA Version: 12.x

CUDNN Version:

Operating System: Ubuntu 20.04

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

pip list:

pip list
Package            Version
------------------ -------------
blinker            1.4
colored            2.2.4
coloredlogs        15.0.1
cryptography       3.4.8
dbus-python        1.2.18
distlib            0.3.8
distro             1.7.0
filelock           3.13.4
flatbuffers        24.3.25
httplib2           0.20.2
humanfriendly      10.0
importlib-metadata 4.6.4
jeepney            0.7.1
keyring            23.5.0
launchpadlib       1.10.16
lazr.restfulclient 0.14.4
lazr.uri           1.0.6
more-itertools     8.10.0
mpmath             1.3.0
numpy              1.26.4
oauthlib           3.2.0
onnxruntime-gpu    1.18.0
packaging          24.0
pip                24.0
platformdirs       4.2.0
polygraphy         0.49.9
protobuf           5.27.0
PyGObject          3.42.1
PyJWT              2.3.0
pyparsing          2.4.7
python-apt         2.4.0+ubuntu3
SecretStorage      3.3.1
setuptools         69.5.1
six                1.16.0
sympy              1.12
virtualenv         20.25.3
wadllib            1.3.6
wheel              0.43.0
zipp               1.0.0

Commands or scripts: trtexec output:

/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan
&&&& RUNNING TensorRT.trtexec [TensorRT v8603] # /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan
[05/23/2024-21:39:22] [I] === Model Options ===
[05/23/2024-21:39:22] [I] Format: ONNX
[05/23/2024-21:39:22] [I] Model: model.onnx
[05/23/2024-21:39:22] [I] Output:
[05/23/2024-21:39:22] [I] === Build Options ===
[05/23/2024-21:39:22] [I] Max batch: explicit batch
[05/23/2024-21:39:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/23/2024-21:39:22] [I] minTiming: 1
[05/23/2024-21:39:22] [I] avgTiming: 8
[05/23/2024-21:39:22] [I] Precision: FP32
[05/23/2024-21:39:22] [I] LayerPrecisions: 
[05/23/2024-21:39:22] [I] Layer Device Types: 
[05/23/2024-21:39:22] [I] Calibration: 
[05/23/2024-21:39:22] [I] Refit: Disabled
[05/23/2024-21:39:22] [I] Version Compatible: Disabled
[05/23/2024-21:39:22] [I] ONNX Native InstanceNorm: Disabled
[05/23/2024-21:39:22] [I] TensorRT runtime: full
[05/23/2024-21:39:22] [I] Lean DLL Path: 
[05/23/2024-21:39:22] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[05/23/2024-21:39:22] [I] Exclude Lean Runtime: Disabled
[05/23/2024-21:39:22] [I] Sparsity: Disabled
[05/23/2024-21:39:22] [I] Safe mode: Disabled
[05/23/2024-21:39:22] [I] Build DLA standalone loadable: Disabled
[05/23/2024-21:39:22] [I] Allow GPU fallback for DLA: Disabled
[05/23/2024-21:39:22] [I] DirectIO mode: Disabled
[05/23/2024-21:39:22] [I] Restricted mode: Disabled
[05/23/2024-21:39:22] [I] Skip inference: Disabled
[05/23/2024-21:39:22] [I] Save engine: model.plan
[05/23/2024-21:39:22] [I] Load engine: 
[05/23/2024-21:39:22] [I] Profiling verbosity: 0
[05/23/2024-21:39:22] [I] Tactic sources: Using default tactic sources
[05/23/2024-21:39:22] [I] timingCacheMode: local
[05/23/2024-21:39:22] [I] timingCacheFile: 
[05/23/2024-21:39:22] [I] Heuristic: Disabled
[05/23/2024-21:39:22] [I] Preview Features: Use default preview flags.
[05/23/2024-21:39:22] [I] MaxAuxStreams: -1
[05/23/2024-21:39:22] [I] BuilderOptimizationLevel: -1
[05/23/2024-21:39:22] [I] Input(s)s format: fp32:CHW
[05/23/2024-21:39:22] [I] Output(s)s format: fp32:CHW
[05/23/2024-21:39:22] [I] Input build shapes: model
[05/23/2024-21:39:22] [I] Input calibration shapes: model
[05/23/2024-21:39:22] [I] === System Options ===
[05/23/2024-21:39:22] [I] Device: 0
[05/23/2024-21:39:22] [I] DLACore: 
[05/23/2024-21:39:22] [I] Plugins:
[05/23/2024-21:39:22] [I] setPluginsToSerialize:
[05/23/2024-21:39:22] [I] dynamicPlugins:
[05/23/2024-21:39:22] [I] ignoreParsedPluginLibs: 0
[05/23/2024-21:39:22] [I] 
[05/23/2024-21:39:22] [I] === Inference Options ===
[05/23/2024-21:39:22] [I] Batch: Explicit
[05/23/2024-21:39:22] [I] Input inference shapes: model
[05/23/2024-21:39:22] [I] Iterations: 10
[05/23/2024-21:39:22] [I] Duration: 3s (+ 200ms warm up)
[05/23/2024-21:39:22] [I] Sleep time: 0ms
[05/23/2024-21:39:22] [I] Idle time: 0ms
[05/23/2024-21:39:22] [I] Inference Streams: 1
[05/23/2024-21:39:22] [I] ExposeDMA: Disabled
[05/23/2024-21:39:22] [I] Data transfers: Enabled
[05/23/2024-21:39:22] [I] Spin-wait: Disabled
[05/23/2024-21:39:22] [I] Multithreading: Disabled
[05/23/2024-21:39:22] [I] CUDA Graph: Disabled
[05/23/2024-21:39:22] [I] Separate profiling: Disabled
[05/23/2024-21:39:22] [I] Time Deserialize: Disabled
[05/23/2024-21:39:22] [I] Time Refit: Disabled
[05/23/2024-21:39:22] [I] NVTX verbosity: 0
[05/23/2024-21:39:22] [I] Persistent Cache Ratio: 0
[05/23/2024-21:39:22] [I] Inputs:
[05/23/2024-21:39:22] [I] === Reporting Options ===
[05/23/2024-21:39:22] [I] Verbose: Disabled
[05/23/2024-21:39:22] [I] Averages: 10 inferences
[05/23/2024-21:39:22] [I] Percentiles: 90,95,99
[05/23/2024-21:39:22] [I] Dump refittable layers:Disabled
[05/23/2024-21:39:22] [I] Dump output: Disabled
[05/23/2024-21:39:22] [I] Profile: Disabled
[05/23/2024-21:39:22] [I] Export timing to JSON file: 
[05/23/2024-21:39:22] [I] Export output to JSON file: 
[05/23/2024-21:39:22] [I] Export profile to JSON file: 
[05/23/2024-21:39:22] [I] 
[05/23/2024-21:39:22] [I] === Device Information ===
[05/23/2024-21:39:22] [I] Selected Device: Tesla T4
[05/23/2024-21:39:22] [I] Compute Capability: 7.5
[05/23/2024-21:39:22] [I] SMs: 40
[05/23/2024-21:39:22] [I] Device Global Memory: 15102 MiB
[05/23/2024-21:39:22] [I] Shared Memory per SM: 64 KiB
[05/23/2024-21:39:22] [I] Memory Bus Width: 256 bits (ECC enabled)
[05/23/2024-21:39:22] [I] Application Compute Clock Rate: 1.59 GHz
[05/23/2024-21:39:22] [I] Application Memory Clock Rate: 5.001 GHz
[05/23/2024-21:39:22] [I] 
[05/23/2024-21:39:22] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[05/23/2024-21:39:22] [I] 
[05/23/2024-21:39:22] [I] TensorRT version: 8.6.3
[05/23/2024-21:39:22] [I] Loading standard plugins
[05/23/2024-21:39:22] [I] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 22, GPU 103 (MiB)
[05/23/2024-21:39:29] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +895, GPU +174, now: CPU 994, GPU 277 (MiB)
[05/23/2024-21:39:29] [I] Start parsing network model.
[05/23/2024-21:39:30] [I] [TRT] ----------------------------------------------------------------
[05/23/2024-21:39:30] [I] [TRT] Input filename:   model.onnx
[05/23/2024-21:39:30] [I] [TRT] ONNX IR version:  0.0.7
[05/23/2024-21:39:30] [I] [TRT] Opset version:    12
[05/23/2024-21:39:30] [I] [TRT] Producer name:    onnx.compose.merge_models
[05/23/2024-21:39:30] [I] [TRT] Producer version: 1.0
[05/23/2024-21:39:30] [I] [TRT] Domain:           
[05/23/2024-21:39:30] [I] [TRT] Model version:    1
[05/23/2024-21:39:30] [I] [TRT] Doc string:       
[05/23/2024-21:39:30] [I] [TRT] ----------------------------------------------------------------
[05/23/2024-21:39:30] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float.
[05/23/2024-21:39:30] [E] [TRT] onnx2trt_utils.cpp:748: Found unsupported datatype (8) when importing initializer: classes
[05/23/2024-21:39:30] [E] [TRT] ModelImporter.cpp:774: ERROR: ModelImporter.cpp:116 In function parseGraph:
[8] Assertion failed: convertOnnxWeights(initializer, &weights, ctx) && "Failed to import initializer."
[05/23/2024-21:39:30] [E] Failed to parse onnx file
[05/23/2024-21:39:30] [I] Finished parsing network model. Parse time: 0.583197
[05/23/2024-21:39:30] [E] Parsing model failed
[05/23/2024-21:39:30] [E] Failed to create engine from model or file.
[05/23/2024-21:39:30] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8603] # /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan

trtexec output with verbose:

/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan --verbose
[05/23/2024-22:14:38] [I] === Model Options ===
[05/23/2024-22:14:38] [I] Format: ONNX
[05/23/2024-22:14:38] [I] Model: model.onnx
[05/23/2024-22:14:38] [I] Output:
[05/23/2024-22:14:38] [I] === Build Options ===
[05/23/2024-22:14:38] [I] Max batch: explicit batch
[05/23/2024-22:14:38] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/23/2024-22:14:38] [I] minTiming: 1
[05/23/2024-22:14:38] [I] avgTiming: 8
[05/23/2024-22:14:38] [I] Precision: FP32
[05/23/2024-22:14:38] [I] LayerPrecisions: 
[05/23/2024-22:14:38] [I] Layer Device Types: 
[05/23/2024-22:14:38] [I] Calibration: 
[05/23/2024-22:14:38] [I] Refit: Disabled
[05/23/2024-22:14:38] [I] Version Compatible: Disabled
[05/23/2024-22:14:38] [I] ONNX Native InstanceNorm: Disabled
[05/23/2024-22:14:38] [I] TensorRT runtime: full
[05/23/2024-22:14:38] [I] Lean DLL Path: 
[05/23/2024-22:14:38] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[05/23/2024-22:14:38] [I] Exclude Lean Runtime: Disabled
[05/23/2024-22:14:38] [I] Sparsity: Disabled
[05/23/2024-22:14:38] [I] Safe mode: Disabled
[05/23/2024-22:14:38] [I] Build DLA standalone loadable: Disabled
[05/23/2024-22:14:38] [I] Allow GPU fallback for DLA: Disabled
[05/23/2024-22:14:38] [I] DirectIO mode: Disabled
[05/23/2024-22:14:38] [I] Restricted mode: Disabled
[05/23/2024-22:14:38] [I] Skip inference: Disabled
[05/23/2024-22:14:38] [I] Save engine: model.plan
[05/23/2024-22:14:38] [I] Load engine: 
[05/23/2024-22:14:38] [I] Profiling verbosity: 0
[05/23/2024-22:14:38] [I] Tactic sources: Using default tactic sources
[05/23/2024-22:14:38] [I] timingCacheMode: local
[05/23/2024-22:14:38] [I] timingCacheFile: 
[05/23/2024-22:14:38] [I] Heuristic: Disabled
[05/23/2024-22:14:38] [I] Preview Features: Use default preview flags.
[05/23/2024-22:14:38] [I] MaxAuxStreams: -1
[05/23/2024-22:14:38] [I] BuilderOptimizationLevel: -1
[05/23/2024-22:14:38] [I] Input(s)s format: fp32:CHW
[05/23/2024-22:14:38] [I] Output(s)s format: fp32:CHW
[05/23/2024-22:14:38] [I] Input build shapes: model
[05/23/2024-22:14:38] [I] Input calibration shapes: model
[05/23/2024-22:14:38] [I] === System Options ===
[05/23/2024-22:14:38] [I] Device: 0
[05/23/2024-22:14:38] [I] DLACore: 
[05/23/2024-22:14:38] [I] Plugins:
[05/23/2024-22:14:38] [I] setPluginsToSerialize:
[05/23/2024-22:14:38] [I] dynamicPlugins:
[05/23/2024-22:14:38] [I] ignoreParsedPluginLibs: 0
[05/23/2024-22:14:38] [I] 
[05/23/2024-22:14:38] [I] === Inference Options ===
[05/23/2024-22:14:38] [I] Batch: Explicit
[05/23/2024-22:14:38] [I] Input inference shapes: model
[05/23/2024-22:14:38] [I] Iterations: 10
[05/23/2024-22:14:38] [I] Duration: 3s (+ 200ms warm up)
[05/23/2024-22:14:38] [I] Sleep time: 0ms
[05/23/2024-22:14:38] [I] Idle time: 0ms
[05/23/2024-22:14:38] [I] Inference Streams: 1
[05/23/2024-22:14:38] [I] ExposeDMA: Disabled
[05/23/2024-22:14:38] [I] Data transfers: Enabled
[05/23/2024-22:14:38] [I] Spin-wait: Disabled
[05/23/2024-22:14:38] [I] Multithreading: Disabled
[05/23/2024-22:14:38] [I] CUDA Graph: Disabled
[05/23/2024-22:14:38] [I] Separate profiling: Disabled
[05/23/2024-22:14:38] [I] Time Deserialize: Disabled
[05/23/2024-22:14:38] [I] Time Refit: Disabled
[05/23/2024-22:14:38] [I] NVTX verbosity: 0
[05/23/2024-22:14:38] [I] Persistent Cache Ratio: 0
[05/23/2024-22:14:38] [I] Inputs:
[05/23/2024-22:14:38] [I] === Reporting Options ===
[05/23/2024-22:14:38] [I] Verbose: Enabled
[05/23/2024-22:14:38] [I] Averages: 10 inferences
[05/23/2024-22:14:38] [I] Percentiles: 90,95,99
[05/23/2024-22:14:38] [I] Dump refittable layers:Disabled
[05/23/2024-22:14:38] [I] Dump output: Disabled
[05/23/2024-22:14:38] [I] Profile: Disabled
[05/23/2024-22:14:38] [I] Export timing to JSON file: 
[05/23/2024-22:14:38] [I] Export output to JSON file: 
[05/23/2024-22:14:38] [I] Export profile to JSON file: 
[05/23/2024-22:14:38] [I] 
[05/23/2024-22:14:38] [I] === Device Information ===
[05/23/2024-22:14:38] [I] Selected Device: Tesla T4
[05/23/2024-22:14:38] [I] Compute Capability: 7.5
[05/23/2024-22:14:38] [I] SMs: 40
[05/23/2024-22:14:38] [I] Device Global Memory: 15102 MiB
[05/23/2024-22:14:38] [I] Shared Memory per SM: 64 KiB
[05/23/2024-22:14:38] [I] Memory Bus Width: 256 bits (ECC enabled)
[05/23/2024-22:14:38] [I] Application Compute Clock Rate: 1.59 GHz
[05/23/2024-22:14:38] [I] Application Memory Clock Rate: 5.001 GHz
[05/23/2024-22:14:38] [I] 
[05/23/2024-22:14:38] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[05/23/2024-22:14:38] [I] 
[05/23/2024-22:14:38] [I] TensorRT version: 8.6.3
[05/23/2024-22:14:38] [I] Loading standard plugins
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::ModulatedDeformConv2d version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::Proposal version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::Split version 1
[05/23/2024-22:14:38] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[05/23/2024-22:14:38] [I] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 22, GPU 103 (MiB)
[05/23/2024-22:14:38] [V] [TRT] Trying to load shared library libnvinfer_builder_resource.so.8.6.3
[05/23/2024-22:14:38] [V] [TRT] Loaded shared library libnvinfer_builder_resource.so.8.6.3
[05/23/2024-22:14:46] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +895, GPU +174, now: CPU 994, GPU 277 (MiB)
[05/23/2024-22:14:46] [V] [TRT] CUDA lazy loading is enabled.
[05/23/2024-22:14:46] [I] Start parsing network model.
[05/23/2024-22:14:46] [I] [TRT] ----------------------------------------------------------------
[05/23/2024-22:14:46] [I] [TRT] Input filename:   model.onnx
[05/23/2024-22:14:46] [I] [TRT] ONNX IR version:  0.0.7
[05/23/2024-22:14:46] [I] [TRT] Opset version:    12
[05/23/2024-22:14:46] [I] [TRT] Producer name:    onnx.compose.merge_models
[05/23/2024-22:14:46] [I] [TRT] Producer version: 1.0
[05/23/2024-22:14:46] [I] [TRT] Domain:           
[05/23/2024-22:14:46] [I] [TRT] Model version:    1
[05/23/2024-22:14:46] [I] [TRT] Doc string:       
[05/23/2024-22:14:46] [I] [TRT] ----------------------------------------------------------------
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::BatchedNMS_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::BatchTilePlugin_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::Clip_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::CoordConvAC version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::CropAndResizeDynamic version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::CropAndResize version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::DecodeBbox3DPlugin version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::DetectionLayer_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::EfficientNMS_Explicit_TF_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::EfficientNMS_Implicit_TF_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::EfficientNMS_ONNX_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::EfficientNMS_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::FlattenConcat_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::GenerateDetection_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::GridAnchor_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::GridAnchorRect_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::LReLU_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::ModulatedDeformConv2d version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::MultilevelCropAndResize_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::MultilevelProposeROI_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::MultiscaleDeformableAttnPlugin_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::NMSDynamic_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::NMS_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::Normalize_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::PillarScatterPlugin version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::PriorBox_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::ProposalDynamic version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::ProposalLayer_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::Proposal version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::PyramidROIAlign_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::Region_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::Reorg_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::ResizeNearest_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::ROIAlign_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::RPROI_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::ScatterND version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::SpecialSlice_TRT version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::Split version 1
[05/23/2024-22:14:46] [V] [TRT] Plugin creator already registered - ::VoxelGeneratorPlugin version 1
[05/23/2024-22:14:46] [V] [TRT] Adding network input: input_ids with dtype: int32, dimensions: (-1, -1)
[05/23/2024-22:14:46] [V] [TRT] Registering tensor: input_ids for ONNX tensor: input_ids
[05/23/2024-22:14:46] [V] [TRT] Adding network input: attention_mask with dtype: int32, dimensions: (-1, -1)
[05/23/2024-22:14:46] [V] [TRT] Registering tensor: attention_mask for ONNX tensor: attention_mask
[05/23/2024-22:14:46] [V] [TRT] Adding network input: token_type_ids with dtype: int32, dimensions: (-1, -1)
[05/23/2024-22:14:46] [V] [TRT] Registering tensor: token_type_ids for ONNX tensor: token_type_ids
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.embeddings.word_embeddings.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.embeddings.position_embeddings.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.embeddings.token_type_embeddings.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.embeddings.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.embeddings.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.attention.self.query.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.attention.self.key.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.attention.self.value.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.attention.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.attention.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.attention.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.intermediate.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.0.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.attention.self.query.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.attention.self.key.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.attention.self.value.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.attention.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.attention.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.attention.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.intermediate.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.1.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.attention.self.query.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.attention.self.key.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.attention.self.value.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.attention.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.attention.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.attention.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.intermediate.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.2.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.attention.self.query.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.attention.self.key.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.attention.self.value.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.attention.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.attention.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.attention.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.intermediate.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.3.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.attention.self.query.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.attention.self.key.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.attention.self.value.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.attention.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.attention.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.attention.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.intermediate.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.4.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.attention.self.query.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.attention.self.key.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.attention.self.value.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.attention.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.attention.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.attention.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.intermediate.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.output.dense.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.output.LayerNorm.weight
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: model_body.encoder.layer.5.output.LayerNorm.bias
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_822
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_823
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_826
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_832
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_833
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_834
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_835
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_836
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_839
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_845
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_846
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_847
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_848
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_849
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_852
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_858
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_859
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_860
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_861
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_862
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_865
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_871
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_872
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_873
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_874
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_875
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_878
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_884
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_885
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_886
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_887
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_888
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_891
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_897
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_898
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: onnx::MatMul_899
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: coef
[05/23/2024-22:14:46] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float.
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: intercept
[05/23/2024-22:14:46] [V] [TRT] Importing initializer: classes
[05/23/2024-22:14:46] [E] [TRT] onnx2trt_utils.cpp:748: Found unsupported datatype (8) when importing initializer: classes
[05/23/2024-22:14:46] [E] [TRT] ModelImporter.cpp:774: ERROR: ModelImporter.cpp:116 In function parseGraph:
[8] Assertion failed: convertOnnxWeights(initializer, &weights, ctx) && "Failed to import initializer."
[05/23/2024-22:14:46] [E] Failed to parse onnx file
[05/23/2024-22:14:46] [I] Finished parsing network model. Parse time: 0.229167
[05/23/2024-22:14:46] [E] Parsing model failed
[05/23/2024-22:14:46] [E] Failed to create engine from model or file.
[05/23/2024-22:14:46] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8603] # /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan --verbose

Have you tried the latest release?:

why polygraphy wants to load a cuda 11.x library? i run this inside nvcr.io/nvidia/pytorch:24.03-py3 docker: Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

polygraphy run model.onnx --onnxrt --providers "CUDAExecutionProvider"
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --providers CUDAExecutionProvider
[I] onnxrt-runner-N0-05/24/24-00:42:23  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2024-05-24 00:42:23.653071659 [E:onnxruntime:Default, provider_bridge_ort.cc:1744 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

2024-05-24 00:42:23.653106541 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:870 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirementsto ensure all dependencies are met.
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: token_type_ids [shape=BoundedShape(['batch_size', 'sequence'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-05/24/24-00:42:23 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)],
     token_type_ids [dtype=int64, shape=(1, 1)]}
[W] Could not convert: object to a corresponding Polygraphy data type. Leaving this type in its source format.
[I] onnxrt-runner-N0-05/24/24-00:42:23 
    ---- Inference Output(s) ----
    {label [dtype=object, shape=(1,)],
     probabilities [dtype=float64, shape=(1, 4)]}
[I] onnxrt-runner-N0-05/24/24-00:42:23  | Completed 1 iteration(s) in 3.36 ms | Average inference time: 3.36 ms.
[I] PASSED | Runtime: 0.404s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --providers CUDAExecutionProvider

# pip list
Package            Version
------------------ -------------
blinker            1.4
colored            2.2.4
coloredlogs        15.0.1
cryptography       3.4.8
dbus-python        1.2.18
distlib            0.3.8
distro             1.7.0
filelock           3.13.4
flatbuffers        24.3.25
httplib2           0.20.2
humanfriendly      10.0
importlib-metadata 4.6.4
jeepney            0.7.1
keyring            23.5.0
launchpadlib       1.10.16
lazr.restfulclient 0.14.4
lazr.uri           1.0.6
more-itertools     8.10.0
mpmath             1.3.0
numpy              1.26.4
oauthlib           3.2.0
onnxruntime-gpu    1.18.0
packaging          24.0
pip                24.0
platformdirs       4.2.0
polygraphy         0.49.9
protobuf           5.27.0
PyGObject          3.42.1
PyJWT              2.3.0
pyparsing          2.4.7
python-apt         2.4.0+ubuntu3
SecretStorage      3.3.1
setuptools         69.5.1
six                1.16.0
sympy              1.12
virtualenv         20.25.3
wadllib            1.3.6
wheel              0.43.0
zipp               1.0.0

cc @pranavm-nvidia @sachanub

Hi, if you want to modify your onnx model, onnx-graphsurgeon is probably your best bet. Examples section shows how to use it.

Btw, the source for onnx-tensorrt is open here. Given that you're using a slightly older version of TRT, the line numbers (and even filenames) may not exactly match. But, you can see a couple of places where such errors are logged. :)

[E] [TRT] onnx2trt_utils.cpp:748: Found unsupported datatype (8) classes

as far i understand i need to convert a pytorch model (im using hugginface Sentence Transformers: https://github.com/huggingface/setfit) to onnx before using tensorrt
so i converted the setfit model myself from pytorch to onnx using https://github.com/huggingface/setfit/blob/main/src/setfit/exporters/onnx.py#L183C5-L183C16
export_onnx_model uses by default opcode=12 ... is that the reason for error: Found unsupported datatype (8) classes? see: https://github.com/huggingface/setfit/blob/main/src/setfit/exporters/onnx.py#L66C82-L66C97
what opcode can i use to avoid the error above?
if i can convert from pytorch to onnx with an opcode that tensorrt supports i dont need onnx-graphsurgeon?

@lix19937 @brb-nv @pranavm-nvidia @sachanub

Hi, if you're open to sharing the onnx file, please consider doing so. Sorry, I'm not familiar with the term 'opcode'. You could point me to something that'll help me understand.

In the meantime, you can also dig deeper using these pointers: From what I can tell looking at the verbose log, TRT encountered a model weight (initializer) named classes of unsupported datatype (possibly a string but I could be wrong). To start with, you can: 1) Open the onnx model in netron 2) Locate the offending tensor and its type (string or something else?) 3) Also, which op this tensor is a part of [reference]

onnx-graphsurgeon is something you could open your onnx model with (just like netron but in your command shell), print your model and make tweaks to it [example]. You could possibly narrow down to the classes tensor from model print-out. So, consider giving it a shot.

Will await you sharing your observations.

here the requested infos:

as far i understand i need to convert a pytorch model (im using hugginface Sentence Transformers: https://github.com/huggingface/setfit) to onnx before using tensorrt. is that correct that i need to do: pytorch -> onnx -> tensorrt?
"Locate the offending tensor and its type (string or something else?)"
- i visualized the model using netron - here is what classes looks like:
- here we see that label is a string:
- if strings are not supported? what to do if classes is a string now? label is a string given by this lib and also output to onnx: https://github.com/huggingface/setfit/blob/main/src/setfit/exporters/onnx.py#L183C5-L183C16
model.onnx is here (added via git LFS: https://github.com/geraldstanje/onnx_model/

what docker image to use for polygraphy? i have cuda 12 installed but polygraphy looks for a cuda 11.x lib: libcublasLt.so.11? i currently use: nvcr.io/nvidia/pytorch:24.03-py3

[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --providers CUDAExecutionProvider
[I] onnxrt-runner-N0-05/24/24-00:42:23  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2024-05-24 00:42:23.653071659 [E:onnxruntime:Default, provider_bridge_ort.cc:1744 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

can i use https://github.com/NVIDIA/TensorRT/blob/master/tools/onnx-graphsurgeon/examples/06_removing_nodes/remove.py to remove the string node?

@lix19937 @brb-nv @pranavm-nvidia @sachanub

@brb-nv Hi, I meet the similar problem, could you please help me? Thanks a lot! I use trtexec to conduct int8 calibration and quantilize like this:

trtexec \
    --onnx=onnx_model/model.onnx \
    --minShapes=xs:1x1120,xlen:1 \
    --optShapes=xs:1x160000,xlen:1 \
    --maxShapes=xs:1x480000,xlen:1 \
    --minShapesCalib=xs:1x1120,xlen:1 \
    --optShapesCalib=xs:1x160000,xlen:1 \
    --maxShapesCalib=xs:1x480000,xlen:1 \
    --workspace=20480 \
    --int8 \
    --calib=model_calibration.cache \
    --saveEngine=trt_model/model-INT8.plan \
    --verbose \
    --buildOnly

The calibration cache file was generated with polygraphy API. But when I run the code above, it give the following error:

[05/24/2024-13:58:26] [I] [TRT] CPLX_M_rfftrfft__333: broadcasting input1 to make tensors conform, dims(input0)=[2,257,512][NONE] dims(input1)=[1,512,-1][NONE].
[05/24/2024-13:58:26] [I] [TRT] CPLX_M_rfftrfft__333: broadcasting input1 to make tensors conform, dims(input0)=[2,257,512][NONE] dims(input1)=[1,512,-1][NONE].
[05/24/2024-13:58:26] [I] Finish parsing network model
[05/24/2024-13:58:26] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[05/24/2024-13:58:26] [I] [TRT] CPLX_M_rfftrfft__333: broadcasting input1 to make tensors conform, dims(input0)=[2,257,512][NONE] dims(input1)=[1,512,-1][NONE].
[05/24/2024-13:58:27] [I] [TRT] Calibration table does not match calibrator algorithm type.
[05/24/2024-13:58:28] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +840, GPU +362, now: CPU 1921, GPU 7055 (MiB)
[05/24/2024-13:58:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +128, GPU +60, now: CPU 2049, GPU 7115 (MiB)
[05/24/2024-13:58:28] [I] [TRT] Timing cache disabled. Turning it on will improve builder speed.
[05/24/2024-13:58:31] [I] [TRT] Detected 2 inputs and 1 output network tensors.
[05/24/2024-13:58:31] [I] [TRT] Total Host Persistent Memory: 58640
[05/24/2024-13:58:31] [I] [TRT] Total Device Persistent Memory: 0
[05/24/2024-13:58:31] [I] [TRT] Total Scratch Memory: 4194304
[05/24/2024-13:58:31] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 132 MiB, GPU 384 MiB
[05/24/2024-13:58:41] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 10032.1ms to assign 80 blocks to 1209 nodes requiring 17717760 bytes.
[05/24/2024-13:58:41] [I] [TRT] Total Activation Memory: 17717760
[05/24/2024-13:58:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2913, GPU 7775 (MiB)
[05/24/2024-13:58:41] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2914, GPU 7785 (MiB)
[05/24/2024-13:58:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2913, GPU 7761 (MiB)
[05/24/2024-13:58:41] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2913, GPU 7769 (MiB)
[05/24/2024-13:58:41] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +16, now: CPU 130, GPU 272 (MiB)
[05/24/2024-13:58:41] [I] [TRT] Starting Calibration.
[05/24/2024-13:58:41] [E] Error[1]: [calibrator.cpp::add::758] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [executionContext.cpp::commonEmitDebugTensor::1264] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [executionContext.cpp::commonEmitDebugTensor::1297] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [executionContext.cpp::executeInternal::626] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [cudaDriverHelpers.cpp::operator()::29] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [cudaDriverHelpers.cpp::operator()::29] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[05/24/2024-13:58:41] [E] Error[1]: [cudaDriverHelpers.cpp::operator()::29] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[05/24/2024-13:58:42] [E] Error[2]: [calibrator.cpp::calibrateEngine::1160] Error Code 2: Internal Error (Assertion context->executeV2(&bindings[0]) failed. )
[05/24/2024-13:58:42] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[05/24/2024-13:58:42] [E] Engine could not be created from network
[05/24/2024-13:58:42] [E] Cuda failure: an illegal memory access was encountered
Aborted (core dumped)

What's wrong with it? In fact, I have tried to conduct quantilize directly with polygraphy API, but the engine size is not as small as we expected, only from 156M (FP32 engine) to 95M, what's more, the inference speed have no improvement compared with FP32 engine. But when I tried to generate INT8 engine with trtexec tool (with no calibration), the engine size become 51M, so I want to conduct int8 calibration and quantilize with trtexec tools to check again. So, can I use trtexec tool and the calibration cache file to achieve this goal? If it works, what's wrong with my code? I'm looking froward to your help, many thanks.

Hi @geraldstanje

as far i understand i need to convert a pytorch model (im using hugginface Sentence Transformers: https://github.com/huggingface/setfit) to onnx before using tensorrt. is that correct that i need to do: pytorch -> onnx -> tensorrt?

Yes, TRT’s primary means of importing a trained model from a framework is through the ONNX interchange format.

Yes, you'll need to remove the unsupported part of the network for the engine to be built. Also, I don't think ArrayFeatureExtractor is supported by TRT right now. So, you're probably better off removing everything after the Argmax op and implement it yourself outside the model definition. The graphsurgeon example you pointed to is the most relevant one. :)

Unsure about the issue with polygraphy. Please try out our latest release.

@yjiangling kindly open a separate issue. It's quite different from OP's issue.

@geraldstanje What is your torch.onnx._export or torch.onnx.export cmd ? Can show here ?

@lix19937 im using the build in function of the huggingface lib which calls torch.onnx.export - see: https://github.com/huggingface/setfit/blob/main/src/setfit/exporters/onnx.py#L183 here it calls the torch.onnx.export: https://github.com/huggingface/setfit/blob/main/src/setfit/exporters/onnx.py#L96-L103

Importing initializer: classes

The root cause is in ArrayFeatureExtractor from your onnx, at the same time, ArrayFeatureExtractor op not support by trt, you can clip the onnx, the out of argmax is one output, another output(out of softmax) not change.

Or you modify the forward code(exclude ArrayFeatureExtractor), then re-export onnx.

i removed the ArrayFeatureExtractor from the onnx model - here it looks like this now:
took nvcr.io/nvidia/tritonserver:24.04-py3 image - i installed polygraphy via pip:

installed packages:

pip list
Package                  Version
------------------------ -------------
blinker                  1.4
colored                  2.2.4
cryptography             3.4.8
dbus-python              1.2.18
distlib                  0.3.8
distro                   1.7.0
filelock                 3.13.4
httplib2                 0.20.2
importlib-metadata       4.6.4
jeepney                  0.7.1
keyring                  23.5.0
launchpadlib             1.10.16
lazr.restfulclient       0.14.4
lazr.uri                 1.0.6
more-itertools           8.10.0
numpy                    1.26.4
nvidia-cuda-runtime-cu12 12.5.39
oauthlib                 3.2.0
pip                      24.0
platformdirs             4.2.0
polygraphy               0.49.9
PyGObject                3.42.1
PyJWT                    2.3.0
pyparsing                2.4.7
python-apt               2.4.0+ubuntu3
SecretStorage            3.3.1
setuptools               69.5.1
six                      1.16.0
tensorrt                 10.0.1
tensorrt-cu12            10.0.1
tensorrt-cu12-bindings   10.0.1
tensorrt-cu12-libs       10.0.1
virtualenv               20.25.3
wadllib                  1.3.6
wheel                    0.43.0
zipp                     1.0.0

run trtexec - get the following trtexec output - does that look good? trtexec.txt
- i see: [05/25/2024-23:05:17] [I] Multithreading: Disabled [05/25/2024-23:05:17] [I] CUDA Graph: Disabled - why is that?

is still have a problem here - how to fix that?

polygraphy run model.plan --trt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.plan --trt
[I] trt-runner-N0-05/25/24-23:05:50     | Activating and starting inference
[I] Loading bytes from /models/model.plan
[E] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 237, Serialized Engine Version: 236)
[!] Could not deserialize engine. See log for details.
[E] FAILED | Runtime: 0.470s | Command: /usr/local/bin/polygraphy run model.plan --trt

how can i see the expected input and output shape using polygraphy - for defining the config.pbtxt files for tritonserver?
is there a way to visualize the model.plan somehow similar to model.onnx plot?
how can i use the generated model.plan and test it for accuracy - can i do that with polygraphy or i need to deploy it with triton server?

cc @lix19937 @brb-nv

run trtexec - get the following trtexec output - does that look good? trtexec.txt

i see: [05/25/2024-23:05:17] [I] Multithreading: Disabled [05/25/2024-23:05:17] [I] CUDA Graph: Disabled - why is that?

It's normal, because you do not set those features.

is still have a problem here - how to fix that? [E] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 237, Serialized Engine Version: 236)

Not match env, use follow cmd


polygraphy convert removed.onnx -o model_poly.plan   

polygraphy run model_poly.plan --trt

how can i see the expected input and output shape using polygraphy - for defining the config.pbtxt files for tritonserver?

polygraphy run removed.onnx --trt \
    --data-loader-script data_loader.py

see more https://github.com/NVIDIA/TensorRT/blob/release/10.0/tools/Polygraphy/examples/cli/run/05_comparing_with_custom_input_data/data_loader.py

is there a way to visualize the model.plan somehow similar to model.onnx plot?

trex
see more https://github.com/NVIDIA/TensorRT/tree/release/10.0/tools/experimental/trt-engine-explorer/trex

how can i use the generated model.plan and test it for accuracy - can i do that with polygraphy or i need to deploy it with triton server?

polygraphy run removed.onnx --trt --onnxrt  --fp16

see more https://github.com/NVIDIA/TensorRT/tree/release/10.0/tools/Polygraphy/examples/cli/run/05_comparing_with_custom_input_data

trtexec:

does it matter if the onnx model has opset=0 or opset=13 when you run trtexec --onnx=removed.onnx --saveEngine=model.plan?
- where can i see that trtexec uses cuda?

polygraphy convert:

ok polygraphy convert removed.onnx -o model_poly.plan works - but why does trtexec or polygraphy convert allow int64 when triton inference server cannot run it?
does polygraphy convert removed.onnx -o model_poly.plan generate the same output as trtexec --onnx=removed.onnx --saveEngine=model.plan?

can i build/reinstall polygraphy somehow that it can use model.plan from trtexec without the following error?

[E] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 237, Serialized Engine Version: 236)

$ polygraphy convert removed.onnx -o model_poly.plan   
[W] ModelImporter.cpp:420: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:420: Make sure input attention_mask has Int64 binding.
[W] ModelImporter.cpp:420: Make sure input token_type_ids has Int64 binding.
[W] ModelImporter.cpp:680: Make sure output label has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] Input tensor: token_type_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
    Profile 0:
        {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
         attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]],
         token_type_ids [min=[1, 1], opt=[1, 1], max=[1, 1]]}
]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
Flags                  | []
Engine Capability      | EngineCapability.STANDARD
Memory Pools           | [WORKSPACE: 14930.56 MiB, TACTIC_DRAM: 14930.56 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity    | ProfilingVerbosity.DETAILED
Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 2.010 seconds

polygraphy run with tensorrt:

here the output for polygraphy run - does that mean inference runs in 4.95 ms?

that converts the onnx model to tensorrt and uses onnx still? what about the onnx opset - which i set to opset=13 before?

$ polygraphy run removed.onnx --trt --onnxrt  --tf32 --execution-providers=cuda
[I] RUNNING | Command: /home/ubuntu/triton_inference_server/create_trt_model/venv/bin/polygraphy run removed.onnx --trt --onnxrt --tf32 --execution-providers=cuda
[I] trt-runner-N0-05/27/24-04:42:14     | Activating and starting inference
[W] ModelImporter.cpp:420: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:420: Make sure input attention_mask has Int64 binding.
[W] ModelImporter.cpp:420: Make sure input token_type_ids has Int64 binding.
[W] ModelImporter.cpp:680: Make sure output label has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] Input tensor: token_type_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
Profile 0:
    {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
     attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]],
     token_type_ids [min=[1, 1], opt=[1, 1], max=[1, 1]]}
]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
Flags                  | [TF32]
Engine Capability      | EngineCapability.STANDARD
Memory Pools           | [WORKSPACE: 14930.56 MiB, TACTIC_DRAM: 14930.56 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity    | ProfilingVerbosity.DETAILED
Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 1.991 seconds
[I] trt-runner-N0-05/27/24-04:42:14    
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)],
token_type_ids [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-05/27/24-04:42:14    
---- Inference Output(s) ----
{label [dtype=int64, shape=(1,)],
probabilities [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-05/27/24-04:42:14     | Completed 1 iteration(s) in 4.848 ms | Average inference time: 4.848 ms.
[I] onnxrt-runner-N0-05/27/24-04:42:14  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2024-05-27 04:42:18.834213613 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph_06a8d9446585464486ee2407d95613e9 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-05-27 04:42:18.835803344 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-27 04:42:18.835830375 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[I] onnxrt-runner-N0-05/27/24-04:42:14 
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)],
token_type_ids [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-05/27/24-04:42:14 
---- Inference Output(s) ----
{label [dtype=int64, shape=(1,)],
probabilities [dtype=float64, shape=(1, 2)]}
[I] onnxrt-runner-N0-05/27/24-04:42:14  | Completed 1 iteration(s) in 20.36 ms | Average inference time: 20.36 ms.
[I] Accuracy Comparison | trt-runner-N0-05/27/24-04:42:14 vs. onnxrt-runner-N0-05/27/24-04:42:14
[I]     Comparing Output: 'label' (dtype=int64, shape=(1,)) with 'label' (dtype=int64, shape=(1,))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-05/27/24-04:42:14: label | Stats: mean=1, std-dev=0, var=0, median=1, min=1 at (0,), max=1 at (0,), avg-magnitude=1
[I]         onnxrt-runner-N0-05/27/24-04:42:14: label | Stats: mean=1, std-dev=0, var=0, median=1, min=1 at (0,), max=1 at (0,), avg-magnitude=1
[I]         Error Metrics: label
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0,), max=0 at (0,), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0,), max=0 at (0,), avg-magnitude=0
[I]         PASSED | Output: 'label' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     Comparing Output: 'probabilities' (dtype=float32, shape=(1, 2)) with 'probabilities' (dtype=float64, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-05/27/24-04:42:14: probabilities | Stats: mean=0.5, std-dev=0.5, var=0.25, median=0.5, min=5.318e-13 at (0, 0), max=1 at (0, 1), avg-magnitude=0.5
[I]         onnxrt-runner-N0-05/27/24-04:42:14: probabilities | Stats: mean=0.5, std-dev=0.5, var=0.25, median=0.5, min=5.318e-13 at (0, 0), max=1 at (0, 1), avg-magnitude=0.5
[I]         Error Metrics: probabilities
[I]             Minimum Required Tolerance: elemwise error | [abs=5.318e-13] OR [rel=3.0581e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.659e-13, std-dev=2.659e-13, var=7.0702e-26, median=2.659e-13, min=1.6263e-18 at (0, 0), max=5.318e-13 at (0, 1), avg-magnitude=2.659e-13
[I]             Relative Difference | Stats: mean=1.5291e-06, std-dev=1.5291e-06, var=2.338e-12, median=1.5291e-06, min=5.318e-13 at (0, 1), max=3.0581e-06 at (0, 0), avg-magnitude=1.5291e-06
[I]         PASSED | Output: 'probabilities' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['label', 'probabilities']
[I] Accuracy Summary | trt-runner-N0-05/27/24-04:42:14 vs. onnxrt-runner-N0-05/27/24-04:42:14 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 7.003s | Command: /home/ubuntu/triton_inference_server/create_trt_model/venv/bin/polygraphy run removed.onnx --trt --onnxrt --tf32 --execution-providers=cuda

polygraphy run for onnx:

this runs inference in 5ms? but thats the same number as with polygraphy for tensorrt?

polygraphy run removed.onnx --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /home/ubuntu/triton_inference_server/create_trt_model/venv/bin/polygraphy run removed.onnx --onnxrt --execution-providers=cuda
[I] onnxrt-runner-N0-05/27/24-04:44:20  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2024-05-27 04:44:21.106331814 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph_06a8d9446585464486ee2407d95613e9 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-05-27 04:44:21.107955920 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-27 04:44:21.107980378 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[W] Input tensor: token_type_ids [shape=BoundedShape(['batch_size', 'sequence'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-05/27/24-04:44:20 
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
 attention_mask [dtype=int64, shape=(1, 1)],
 token_type_ids [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-05/27/24-04:44:20 
---- Inference Output(s) ----
{label [dtype=int64, shape=(1,)],
 probabilities [dtype=float64, shape=(1, 2)]}
[I] onnxrt-runner-N0-05/27/24-04:44:20  | Completed 1 iteration(s) in 5.083 ms | Average inference time: 5.083 ms.
[I] PASSED | Runtime: 1.594s | Command: /home/ubuntu/triton_inference_server/create_trt_model/venv/bin/polygraphy run removed.onnx --onnxrt --execution-providers=cuda

cc @lix19937 @brb-nv

You should ensure that build and run the engine in the same enviroment. This kind of error is usually due to different tensorrt versions.

The tensorrt_version of your polygraphy not match your tensorrt_version of trtexec. If they are the same tensorrt version, the plans are the same.

i run trtexec and polygraphy (installed via pip install polygraphy) in the same docker container - using: nvcr.io/nvidia/tritonserver:24.04-py3 image - how to get trtexec and polygraphy to match the same engine there?

@lix19937

@yjiangling kindly open a separate issue. It's quite different from OP's issue.

Ok, thank you. I have open an new issue in here: https://github.com/NVIDIA/TensorRT/issues/3902

i run trtexec and polygraphy (installed via pip install polygraphy) in the same docker container - using: nvcr.io/nvidia/tritonserver:24.04-py3 image - how to get trtexec and polygraphy to match the same engine there?

@lix19937

You can

pip list | grep tensorrt

and

trtexec  --help |grep 'TensorRT.trtexec'   
ldd -r  trtexec

i get Segmentation fault with trtexec using engine v100001 - the same worked with trtexec using engine v8603 - any idea?


$ /opt/tritonserver/TensorRT/build/trtexec --onnx=removed.onnx --saveEngine=model2.plan --verbose     
&&&& RUNNING TensorRT.trtexec [TensorRT v100001] # /opt/tritonserver/TensorRT/build/trtexec --onnx=removed.onnx --saveEngine=model2.plan --verbose
[05/27/2024-16:39:16] [I] === Model Options ===
[05/27/2024-16:39:16] [I] Format: ONNX
[05/27/2024-16:39:16] [I] Model: removed.onnx
[05/27/2024-16:39:16] [I] Output:
[05/27/2024-16:39:16] [I] === Build Options ===
[05/27/2024-16:39:16] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[05/27/2024-16:39:16] [I] avgTiming: 8
[05/27/2024-16:39:16] [I] Precision: FP32
[05/27/2024-16:39:16] [I] LayerPrecisions: 
[05/27/2024-16:39:16] [I] Layer Device Types: 
[05/27/2024-16:39:16] [I] Calibration: 
[05/27/2024-16:39:16] [I] Refit: Disabled
[05/27/2024-16:39:16] [I] Strip weights: Disabled
[05/27/2024-16:39:16] [I] Version Compatible: Disabled
[05/27/2024-16:39:16] [I] ONNX Plugin InstanceNorm: Disabled
[05/27/2024-16:39:16] [I] TensorRT runtime: full
[05/27/2024-16:39:16] [I] Lean DLL Path: 
[05/27/2024-16:39:16] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[05/27/2024-16:39:16] [I] Exclude Lean Runtime: Disabled
[05/27/2024-16:39:16] [I] Sparsity: Disabled
[05/27/2024-16:39:16] [I] Safe mode: Disabled
[05/27/2024-16:39:16] [I] Build DLA standalone loadable: Disabled
[05/27/2024-16:39:16] [I] Allow GPU fallback for DLA: Disabled
[05/27/2024-16:39:16] [I] DirectIO mode: Disabled
[05/27/2024-16:39:16] [I] Restricted mode: Disabled
[05/27/2024-16:39:16] [I] Skip inference: Disabled
[05/27/2024-16:39:16] [I] Save engine: model2.plan
[05/27/2024-16:39:16] [I] Load engine: 
[05/27/2024-16:39:16] [I] Profiling verbosity: 0
[05/27/2024-16:39:16] [I] Tactic sources: Using default tactic sources
[05/27/2024-16:39:16] [I] timingCacheMode: local
[05/27/2024-16:39:16] [I] timingCacheFile: 
[05/27/2024-16:39:16] [I] Enable Compilation Cache: Enabled
[05/27/2024-16:39:16] [I] errorOnTimingCacheMiss: Disabled
[05/27/2024-16:39:16] [I] Preview Features: Use default preview flags.
[05/27/2024-16:39:16] [I] MaxAuxStreams: -1
[05/27/2024-16:39:16] [I] BuilderOptimizationLevel: -1
[05/27/2024-16:39:16] [I] Calibration Profile Index: 0
[05/27/2024-16:39:16] [I] Weight Streaming: Disabled
[05/27/2024-16:39:16] [I] Debug Tensors: 
[05/27/2024-16:39:16] [I] Input(s)s format: fp32:CHW
[05/27/2024-16:39:16] [I] Output(s)s format: fp32:CHW
[05/27/2024-16:39:16] [I] Input build shapes: model
[05/27/2024-16:39:16] [I] Input calibration shapes: model
[05/27/2024-16:39:16] [I] === System Options ===
[05/27/2024-16:39:16] [I] Device: 0
[05/27/2024-16:39:16] [I] DLACore: 
[05/27/2024-16:39:16] [I] Plugins:
[05/27/2024-16:39:16] [I] setPluginsToSerialize:
[05/27/2024-16:39:16] [I] dynamicPlugins:
[05/27/2024-16:39:16] [I] ignoreParsedPluginLibs: 0
[05/27/2024-16:39:16] [I] 
[05/27/2024-16:39:16] [I] === Inference Options ===
[05/27/2024-16:39:16] [I] Batch: Explicit
[05/27/2024-16:39:16] [I] Input inference shapes: model
[05/27/2024-16:39:16] [I] Iterations: 10
[05/27/2024-16:39:16] [I] Duration: 3s (+ 200ms warm up)
[05/27/2024-16:39:16] [I] Sleep time: 0ms
[05/27/2024-16:39:16] [I] Idle time: 0ms
[05/27/2024-16:39:16] [I] Inference Streams: 1
[05/27/2024-16:39:16] [I] ExposeDMA: Disabled
[05/27/2024-16:39:16] [I] Data transfers: Enabled
[05/27/2024-16:39:16] [I] Spin-wait: Disabled
[05/27/2024-16:39:16] [I] Multithreading: Disabled
[05/27/2024-16:39:16] [I] CUDA Graph: Disabled
[05/27/2024-16:39:16] [I] Separate profiling: Disabled
[05/27/2024-16:39:16] [I] Time Deserialize: Disabled
[05/27/2024-16:39:16] [I] Time Refit: Disabled
[05/27/2024-16:39:16] [I] NVTX verbosity: 0
[05/27/2024-16:39:16] [I] Persistent Cache Ratio: 0
[05/27/2024-16:39:16] [I] Optimization Profile Index: 0
[05/27/2024-16:39:16] [I] Weight Streaming Budget: Disabled
[05/27/2024-16:39:16] [I] Inputs:
[05/27/2024-16:39:16] [I] Debug Tensor Save Destinations:
[05/27/2024-16:39:16] [I] === Reporting Options ===
[05/27/2024-16:39:16] [I] Verbose: Enabled
[05/27/2024-16:39:16] [I] Averages: 10 inferences
[05/27/2024-16:39:16] [I] Percentiles: 90,95,99
[05/27/2024-16:39:16] [I] Dump refittable layers:Disabled
[05/27/2024-16:39:16] [I] Dump output: Disabled
[05/27/2024-16:39:16] [I] Profile: Disabled
[05/27/2024-16:39:16] [I] Export timing to JSON file: 
[05/27/2024-16:39:16] [I] Export output to JSON file: 
[05/27/2024-16:39:16] [I] Export profile to JSON file: 
[05/27/2024-16:39:16] [I] 
[05/27/2024-16:39:16] [I] === Device Information ===
[05/27/2024-16:39:16] [I] Available Devices: 
[05/27/2024-16:39:16] [I]   Device 0: "Tesla T4" UUID: GPU-2c78ffbb-6ac9-111e-43ed-0c697b4619d4
[05/27/2024-16:39:16] [I] Selected Device: Tesla T4
[05/27/2024-16:39:16] [I] Selected Device ID: 0
[05/27/2024-16:39:16] [I] Selected Device UUID: GPU-2c78ffbb-6ac9-111e-43ed-0c697b4619d4
[05/27/2024-16:39:16] [I] Compute Capability: 7.5
[05/27/2024-16:39:16] [I] SMs: 40
[05/27/2024-16:39:16] [I] Device Global Memory: 14930 MiB
[05/27/2024-16:39:16] [I] Shared Memory per SM: 64 KiB
[05/27/2024-16:39:16] [I] Memory Bus Width: 256 bits (ECC enabled)
[05/27/2024-16:39:16] [I] Application Compute Clock Rate: 1.59 GHz
[05/27/2024-16:39:16] [I] Application Memory Clock Rate: 5.001 GHz
[05/27/2024-16:39:16] [I] 
[05/27/2024-16:39:16] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[05/27/2024-16:39:16] [I] 
[05/27/2024-16:39:16] [I] TensorRT version: 10.0.1
[05/27/2024-16:39:16] [I] Loading standard plugins
Segmentation fault (core dumped)

$ ldd -r /opt/tritonserver/TensorRT/build/trtexec linux-vdso.so.1 (0x00007ffe0b1e3000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff1c73d7000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff1c72f0000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff1c72d0000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff1c70a7000) /lib64/ld-linux-x86-64.so.2 (0x00007ff1c774b000)

$ /opt/tritonserver/TensorRT/build/trtexec --help |grep 'TensorRT.trtexec'
&&&& RUNNING TensorRT.trtexec [TensorRT v100001] # /opt/tritonserver/TensorRT/build/trtexec --help

$ pip list | grep "tensorrt" tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1



- i build trtexec for v100001 myself:
[build_trt.txt](https://github.com/NVIDIA/TensorRT/files/15458361/build_trt.txt)
- does it matter if the onnx model has opset=0, opset=13 or opset=90 when you run trtexec --onnx=removed.onnx --saveEngine=model.plan?
- why does ```polygraphy convert removed.onnx -o model_poly.plan``` or  ```trtexec --onnx=removed.onnx --saveEngine=model2.plan``` use int64 and not convert it already to e.g. int32 when triton inference server cannot run it?

cc @lix19937 @brb-nv

i get Segmentation fault with trtexec using engine v100001 - the same worked with trtexec using engine v8603 - any idea?

Your cuda runtime/drive env not match v1001 requirement, see https://github.com/NVIDIA/TensorRT?tab=readme-ov-file#prerequisites.

does it matter if the onnx model has opset=0, opset=13 or opset=90 when you run trtexec --onnx=removed.onnx --saveEngine=model.plan?

TensorRT’s primary means of importing a trained model from a framework is through the ONNX interchange format. TensorRT ships with an ONNX parser library to assist in importing models. Where possible, the parser is backward compatible up to opset 9; the ONNX Model Opset Version Converter can assist in resolving incompatibilities. The GitHub version may support later opsets than the version shipped with TensorRT. Refer to the ONNX-TensorRT operator support matrix for the latest information on the supported opset and operators. For TensorRT deployment, we recommend exporting to the latest available ONNX opset.

why does polygraphy convert removed.onnx -o model_poly.plan or trtexec --onnx=removed.onnx --saveEngine=model2.plan use int64 and not convert it already to e.g. int32 when triton inference server cannot run it?

Your need know one thing, three tool or process all need libnvinfer.so or libnvinfer.a, if you make sure use the same version so, then pass.

Your cuda runtime/drive env not match v1001 requirement, see https://github.com/NVIDIA/TensorRT?tab=readme-ov-file#prerequisites.

ok that worked (used docker image: nvcr.io/nvidia/pytorch:24.05-py3)

TensorRT’s primary means of importing a trained model from a framework is through the ONNX interchange format. TensorRT ships with an ONNX parser library to assist in importing models. Where possible, the parser is backward compatible up to opset 9; the ONNX Model Opset Version Converter can assist in resolving incompatibilities.

supported onnx opset for tensorrt 8.6.3 (https://github.com/onnx/onnx-tensorrt/blob/6872a9473391a73b96741711d52b98c2c3e25146/docs/operators.md )

TensorRT 8.6 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)
TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

does that mean opset 0 to opset 17? it it better for tensorrt to export the onnx model with opset 0 or opset 17?

polygraphy:

also how can i display some more information like latency, used memory, input shape, output shape for polygraphy run model.plan?
also how can i check that tensorrt uses the gpu to run vs cpu? can i check that with polygraphy run too?

cc @lix19937 @brb-nv

TensorRT 8.6 supports operators up to Opset 17.

This generally means all opset versions [0, 17] of a certain op is supported. For example, if you look at operators.md, BatchNorm has been updated in opsets 15, 14, 9, 7, 6, 1. Given the statement above, ideally, BatchNorm of all those opsets must be supported by TRT.

does that mean opset 0 to opset 17? it it better for tensorrt to export the onnx model with opset 0 or opset 17?

I'd try to export with opset 17 and keep in mind any gaps in TRT support by looking at the onnx2trt support matrix.

also how can i display some more information like latency, used memory, input shape, output shape for polygraphy run model.plan?

I think trtexec (with --verbose option) shows everything you're looking for. Polygraphy is more suitable for accuracy debugging and not as much for measuring performance.

also how can i check that tensorrt uses the gpu to run vs cpu? can i check that with polygraphy run too?

I can see this with trtexec with --verbose option.

i have a model.plan - i dont know which settings (input shape, output shape, batch_size) i defined for trtexec - how an i figure it out? can i load the model.plan with polygraphy or another tool to get infos/

als is it possible to write the tokenizer in c++ (for huggingface sentence transformer model) for triton inference server? current code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
tokens = tokenizer(sentences, return_token_type_ids=True, return_tensors=TensorType.NUMPY, max_length=128, truncation=True)

cc @lix19937 @brb-nv

Use trtexec --loadEngine=model.plan --verbose can show some msg.

@lix19937 whats the diff between the main.cpp and the trtexec --loadEngine=model.plan --verbose ?

The effect is basically the same,

trtexec is a command line wrapper tool, to quickly utilize TensorRT without having to develop your own application. The trtexec tool has two main purposes:

It’s useful for benchmarking networks on random or user-provided input data. It’s useful for generating serialized engines from models.

main.cpp is a demo to show how to use cpp to infer and integrate into your own project.

@lix19937 what is the purpose of workspace? can workspace still be used?

i did compare the tensorrt python lib with trtexec:

the python library generates different output - any idea why i see small different output?
how to check if there is an issue between model.plan model2.plan files?
is there a way to check if both files contain the same plan besides compare the file size?

output files:

-rw-r--r-- 1 root root   91807988 Jun 21 03:26 model.plan  <--- generated with tensorrt python code
-rw-r--r-- 1 root root   91815788 Jun 21 03:24 model2.plan <--- generated with trtexec

python version:

import tensorrt as trt

def convert_onnx_to_trt(onnx_model_path, trt_model_path, workspace=140000):
    # Create a TensorRT logger
    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)

    # Create a builder and a network
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )

    # Create a parser to read the onnx file
    parser = trt.OnnxParser(network, TRT_LOGGER)

    # Parse the ONNX model
    with open(onnx_model_path, 'rb') as model_file:
        if not parser.parse(model_file.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    # Set the builder configuration
    config = builder.create_builder_config()
    config.max_workspace_size = workspace * (1024 * 1024)  # Convert to bytes

    # Set optimization profiles
    profile = builder.create_optimization_profile()
    profile.set_shape("input_ids", (1, 1), (1, 128), (1, 512))
    profile.set_shape("attention_mask", (1, 1), (1, 128), (1, 512))
    profile.set_shape("token_type_ids", (1, 1), (1, 128), (1, 512))
    config.add_optimization_profile(profile)

    # Build the engine
    engine = builder.build_engine(network, config)
    if engine is None:
        print("Failed to build the engine!")
        return None

    # Serialize and save the engine
    with open(trt_model_path, 'wb') as engine_file:
        engine_file.write(engine.serialize())

    print(f"Successfully converted {onnx_model_path} to {trt_model_path}")

# Example usage
ONNX_MODEL_PATH = "model.onnx"
TRT_MODEL_PATH = "model.trt"
convert_onnx_to_trt(ONNX_MODEL_PATH, TRT_MODEL_PATH)

bash version:

#!/bin/bash

# readme about trtexec: https://github.com/NVIDIA/TensorRT/blob/master/samples/trtexec/README.md?plain=1

ONNX_MODEL_NAME=$1
TRT_MODEL_NAME=$2
WORKSPACE=140000

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --minShapes=input_ids:1x1,attention_mask:1x1,token_type_ids:1x1 \
    --optShapes=input_ids:1x128,attention_mask:1x128,token_type_ids:1x128 \
    --maxShapes=input_ids:1x512,attention_mask:1x512,token_type_ids:1x512 \
    --workspace=${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose

tensorrt lib:

pip list | grep "tensorrt"
tensorrt                 8.6.1
tensorrt-bindings        8.6.1
tensorrt-libs            8.6.1

more infos:

find / -name "tensorrt.so"
/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so
root@7ae30d5c9eea:/workspace# ldd -r /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so
    linux-vdso.so.1 (0x00007ffff81fa000)
    libnvinfer.so.8 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/libnvinfer.so.8 (0x00007f48da351000)
    libnvonnxparser.so.8 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/libnvonnxparser.so.8 (0x00007f48d9e00000)
    libnvparsers.so.8 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/libnvparsers.so.8 (0x00007f48d9800000)
    libnvinfer_plugin.so.8 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/libnvinfer_plugin.so.8 (0x00007f48d7343000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f48d7117000)
    libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f48e90d1000)
    libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f48e90b1000)
    libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f48d6eef000)
    libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f48e90aa000)
    libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f48e90a5000)
    librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007f48e90a0000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f48e91c8000)
    libcublas.so.12 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/../nvidia/cublas/lib/libcublas.so.12 (0x00007f48d0600000)
    libcublasLt.so.12 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/../nvidia/cublas/lib/libcublasLt.so.12 (0x00007f48ae600000)
    libcudnn.so.8 => /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/../tensorrt_libs/../nvidia/cudnn/lib/libcudnn.so.8 (0x00007f48ae200000)
undefined symbol: PyInstanceMethod_Type (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_ValueError  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _Py_TrueStruct    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_IndexError  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_Type    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyModule_Type (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySlice_Type  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyMemoryView_Type (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _Py_NoneStruct    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_MemoryError (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyType_Type   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyByteArray_Type  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCFunction_Type  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_OverflowError   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyProperty_Type   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_BufferError (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_DeprecationWarning  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_RuntimeError    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _Py_NotImplementedStruct  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyBaseObject_Type (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_StopIteration   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_TypeError   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyMethod_Type (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _Py_FalseStruct   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_Type   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyFloat_Type  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_SystemError (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyExc_ImportError (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_GenericGetDict   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_GenericSetDict   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyMemoryView_FromObject   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyTuple_SetItem   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_GetBuffer    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_Repr (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyLong_AsLong (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyLong_FromSsize_t    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyByteArray_Size  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_Call (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyIter_Check  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_And  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_NormalizeException  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyInstanceMethod_New  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyEval_AcquireThread  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_Str  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThreadState_DeleteCurrent   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyGILState_GetThisThreadState (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_GetAttrString    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyMem_Free    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_Restore (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyType_IsSubtype  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyModule_AddObject    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_WarnEx  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_CheckBuffer  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_SetPointer  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyTuple_New   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_SetAttr  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_IsInstance   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyEval_RestoreThread  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyException_SetTraceback  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_Float    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyUnicode_FromFormat  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyList_Append (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySlice_AdjustIndices (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThreadState_GetFrame    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_Contains   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_Next   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyList_Size   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyTuple_Size  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyMemoryView_FromBuffer   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_Long (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyBuffer_Release  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_GetIter  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_Format  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_CallObject   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyFloat_FromDouble    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyFloat_AsDouble  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyUnicode_DecodeUTF8  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _Py_Dealloc   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyByteArray_AsString  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyList_New    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyImport_ImportModule (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_Check    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _PyObject_GetDictPtr  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyUnicode_FromString  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyIndex_Check (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: Py_GetVersion (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_SetContext  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyFrame_GetLineNumber (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThread_tss_get  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyBytes_Size  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySequence_Check  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyList_GetItem    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyException_SetContext    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_Clear   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_HasAttrString    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyWeakref_NewRef  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_New    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_SetString   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_GetContext  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThreadState_Get (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_SetItem  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySlice_Unpack    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_New (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyMem_Calloc  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_SetAttrString    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyGILState_Release    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_GetPointer  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_Xor  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThread_tss_alloc    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyEval_GetLocals  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyBytes_AsString  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_LengthHint   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_GetItemWithError   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThread_tss_set  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_GetItem  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyType_Ready  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyEval_SaveThread (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySequence_GetItem    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_Invert   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_ClearWeakRefs    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySequence_Size   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyLong_FromLong   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyEval_GetBuiltins    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_WriteUnraisable (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_RichCompareBool  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyNumber_Or   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyModule_Create2  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThread_tss_create   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyLong_AsUnsignedLong (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyLong_FromSize_t (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyFrame_GetBack   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_SetName (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyUnicode_AsEncodedString (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_Occurred    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_Copy   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyErr_Fetch   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThreadState_New (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _PyThreadState_UncheckedGet   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: _PyType_Lookup    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_CallFunctionObjArgs  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_Size   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyIter_Next   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCallable_Check  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PySequence_Tuple  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyThreadState_Clear   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyDict_DelItemString  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyUnicode_AsUTF8AndSize   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyGILState_Ensure (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyObject_Malloc   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCMethod_New (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyCapsule_GetName (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyTuple_GetItem   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyFrame_GetCode   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyException_SetCause  (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyUnicode_AsUTF8String    (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
undefined symbol: PyBytes_AsStringAndSize   (/usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so)
root@7ae30d5c9eea:/workspace# 
root@7ae30d5c9eea:/workspace# 
root@7ae30d5c9eea:/workspace# find / -name "trtexec"
/usr/src/tensorrt/bin/trtexec
root@7ae30d5c9eea:/workspace# ldd -r /usr/src/tensorrt/bin/trtexec
    linux-vdso.so.1 (0x00007ffd1e97d000)
    libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1953bb4000)
    libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1953baf000)
    librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007f1953baa000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f195397e000)
    libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f1953895000)
    libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1953875000)
    libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f195364d000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f1953bc7000)

more infos:

dpkg -l | grep TensorRT
ii  libnvinfer-bin                  8.6.1.6-1+cuda12.0                      amd64        TensorRT binaries
ii  libnvinfer-dev                  8.6.1.6-1+cuda12.0                      amd64        TensorRT development libraries
ii  libnvinfer-dispatch-dev         8.6.1.6-1+cuda12.0                      amd64        TensorRT development dispatch runtime libraries
ii  libnvinfer-dispatch8            8.6.1.6-1+cuda12.0                      amd64        TensorRT dispatch runtime library
ii  libnvinfer-headers-dev          8.6.1.6-1+cuda12.0                      amd64        TensorRT development headers
ii  libnvinfer-headers-plugin-dev   8.6.1.6-1+cuda12.0                      amd64        TensorRT plugin headers
ii  libnvinfer-lean-dev             8.6.1.6-1+cuda12.0                      amd64        TensorRT lean runtime libraries
ii  libnvinfer-lean8                8.6.1.6-1+cuda12.0                      amd64        TensorRT lean runtime library
ii  libnvinfer-plugin-dev           8.6.1.6-1+cuda12.0                      amd64        TensorRT plugin libraries
ii  libnvinfer-plugin8              8.6.1.6-1+cuda12.0                      amd64        TensorRT plugin libraries
ii  libnvinfer-vc-plugin-dev        8.6.1.6-1+cuda12.0                      amd64        TensorRT vc-plugin library
ii  libnvinfer-vc-plugin8           8.6.1.6-1+cuda12.0                      amd64        TensorRT vc-plugin library
ii  libnvinfer8                     8.6.1.6-1+cuda12.0                      amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev            8.6.1.6-1+cuda12.0                      amd64        TensorRT ONNX libraries
ii  libnvonnxparsers8               8.6.1.6-1+cuda12.0                      amd64        TensorRT ONNX libraries
ii  libnvparsers-dev                8.6.1.6-1+cuda12.0                      amd64        TensorRT parsers libraries
ii  libnvparsers8                   8.6.1.6-1+cuda12.0                      amd64        TensorRT parsers libraries
ii  tensorrt-dev                    8.6.1.6-1+cuda12.0                      amd64        Meta package for TensorRT development libraries

/usr/src/tensorrt/bin/trtexec --version
[06/21/2024-03:45:05] [E] Model missing or format not recognized
=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)

=== Build Options ===
  --maxBatch                         Set max batch size and build an implicit batch engine (default = same size as --batch)
                                     This option should not be used when the input model is ONNX or when dynamic shapes are provided.
  --minShapes=spec                   Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec                   Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec                   Build with dynamic shapes using a profile with the max shapes provided
  --minShapesCalib=spec              Calibrate with dynamic shapes using a profile with the min shapes provided
  --optShapesCalib=spec              Calibrate with dynamic shapes using a profile with the opt shapes provided
  --maxShapesCalib=spec              Calibrate with dynamic shapes using a profile with the max shapes provided
                                     Note: All three of min, opt and max shapes must be supplied.
                                           However, if only opt shapes is supplied then it will be expanded so
                                           that min shapes and max shapes are set to the same values as opt shapes.
                                           Input names can be wrapped with escaped single quotes (ex: 'Input:0').
                                     Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
                                     Each input shape is supplied as a key-value pair where key is the input name and
                                     value is the dimensions (including the batch dimension) to be used for that input.
                                     Each key-value pair has the key and value separated using a colon (:).
                                     Multiple input shapes can be provided via comma-separated key-value pairs.
  --inputIOFormats=spec              Type and format of each of the input tensors (default = all inputs in fp32:chw)
                                     See --outputIOFormats help for the grammar of type and format list.
                                     Note: If this option is specified, please set comma-separated types and formats for all
                                           inputs following the same order as network inputs ID (even if only one input
                                           needs specifying IO format) or set the type and format once for broadcasting.
  --outputIOFormats=spec             Type and format of each of the output tensors (default = all outputs in fp32:chw)
                                     Note: If this option is specified, please set comma-separated types and formats for all
                                           outputs following the same order as network outputs ID (even if only one output
                                           needs specifying IO format) or set the type and format once for broadcasting.
                                     IO Formats: spec  ::= IOfmt[","spec]
                                                 IOfmt ::= type:fmt
                                               type  ::= "fp32"|"fp16"|"int32"|"int8"
                                               fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8"|
                                                          "cdhw32"|"hwc"|"dla_linear"|"dla_hwc4")["+"fmt]
  --workspace=N                      Set workspace size in MiB.
  --memPoolSize=poolspec             Specify the size constraints of the designated memory pool(s) in MiB.
                                     Note: Also accepts decimal sizes, e.g. 0.25MiB. Will be rounded down to the nearest integer bytes.
                                     In particular, for dlaSRAM the bytes will be rounded down to the nearest power of 2.
                                   Pool constraint: poolspec ::= poolfmt[","poolspec]
                                                      poolfmt ::= pool:sizeInMiB
                                                    pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"
  --profilingVerbosity=mode          Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)
  --minTiming=M                      Set the minimum number of iterations used in kernel selection (default = 1)
  --avgTiming=M                      Set the number of times averaged in each iteration for kernel selection (default = 8)
  --refit                            Mark the engine as refittable. This will allow the inspection of refittable layers 
                                     and weights within the engine.
  --versionCompatible, --vc          Mark the engine as version compatible. This allows the engine to be used with newer versions
                                     of TensorRT on the same host OS, as well as TensorRT's dispatch and lean runtimes.
                                     Only supported with explicit batch.
  --useRuntime=runtime               TensorRT runtime to execute engine. "lean" and "dispatch" require loading VC engine and do
                                     not support building an engine.
                                           runtime::= "full"|"lean"|"dispatch"
  --leanDLLPath=<file>               External lean runtime DLL to use in version compatiable mode.
  --excludeLeanRuntime               When --versionCompatible is enabled, this flag indicates that the generated engine should
                                     not include an embedded lean runtime. If this is set, the user must explicitly specify a
                                     valid lean runtime to use when loading the engine.  Only supported with explicit batch
                                     and weights within the engine.
  --sparsity=spec                    Control sparsity (default = disabled). 
                                   Sparsity: spec ::= "disable", "enable", "force"
                                     Note: Description about each of these options is as below
                                           disable = do not enable sparse tactics in the builder (this is the default)
                                           enable  = enable sparse tactics in the builder (but these tactics will only be
                                                     considered if the weights have the right sparsity pattern)
                                           force   = enable sparse tactics in the builder and force-overwrite the weights to have
                                                     a sparsity pattern (even if you loaded a model yourself)
  --noTF32                           Disable tf32 precision (default is to enable tf32, in addition to fp32)
  --fp16                             Enable fp16 precision, in addition to fp32 (default = disabled)
  --int8                             Enable int8 precision, in addition to fp32 (default = disabled)
  --fp8                              Enable fp8 precision, in addition to fp32 (default = disabled)
  --best                             Enable all precisions to achieve the best performance (default = disabled)
  --directIO                         Avoid reformatting at network boundaries. (default = disabled)
  --precisionConstraints=spec        Control precision constraint setting. (default = none)
                                       Precision Constraints: spec ::= "none" | "obey" | "prefer"
                                         none = no constraints
                                         prefer = meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible
                                         obey = meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail
                                                otherwise
  --layerPrecisions=spec             Control per-layer precision constraints. Effective only when precisionConstraints is set to
                                   "obey" or "prefer". (default = none)
                                   The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                                     layerName to specify the default precision for all the unspecified layers.
                                   Per-layer precision spec ::= layerPrecision[","spec]
                                                       layerPrecision ::= layerName":"precision
                                                       precision ::= "fp32"|"fp16"|"int32"|"int8"
  --layerOutputTypes=spec            Control per-layer output type constraints. Effective only when precisionConstraints is set to
                                   "obey" or "prefer". (default = none
                                   The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                                     layerName to specify the default precision for all the unspecified layers. If a layer has more than
                                   one output, then multiple types separated by "+" can be provided for this layer.
                                   Per-layer output type spec ::= layerOutputTypes[","spec]
                                                         layerOutputTypes ::= layerName":"type
                                                         type ::= "fp32"|"fp16"|"int32"|"int8"["+"type]
  --layerDeviceTypes=spec            Specify layer-specific device type.
                                     The specs are read left-to-right, and later ones override earlier ones. If a layer does not have
                                     a device type specified, the layer will opt for the default device type.
                                   Per-layer device type spec ::= layerDeviceTypePair[","spec]
                                                         layerDeviceTypePair ::= layerName":"deviceType
                                                           deviceType ::= "GPU"|"DLA"
  --calib=<file>                     Read INT8 calibration cache file
  --safe                             Enable build safety certified engine, if DLA is enable, --buildDLAStandalone will be specified
                                     automatically (default = disabled)
  --buildDLAStandalone               Enable build DLA standalone loadable which can be loaded by cuDLA, when this option is enabled, 
                                     --allowGPUFallback is disallowed and --skipInference is enabled by default. Additionally, 
                                     specifying --inputIOFormats and --outputIOFormats restricts I/O data type and memory layout
                                     (default = disabled)
  --allowGPUFallback                 When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  --consistency                      Perform consistency checking on safety certified engine
  --restricted                       Enable safety scope checking with kSAFETY_SCOPE build flag
  --saveEngine=<file>                Save the serialized engine
  --loadEngine=<file>                Load a serialized engine
  --tacticSources=tactics            Specify the tactics to be used by adding (+) or removing (-) tactics from the default 
                                     tactic sources (default = all available tactics).
                                     Note: Currently only cuDNN, cuBLAS, cuBLAS-LT, and edge mask convolutions are listed as optional
                                           tactics.
                                   Tactic Sources: tactics ::= [","tactic]
                                                     tactic  ::= (+|-)lib
                                                   lib     ::= "CUBLAS"|"CUBLAS_LT"|"CUDNN"|"EDGE_MASK_CONVOLUTIONS"
                                                               |"JIT_CONVOLUTIONS"
                                     For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS
  --noBuilderCache                   Disable timing cache in builder (default is to enable timing cache)
  --heuristic                        Enable tactic selection heuristic in builder (default is to disable the heuristic)
  --timingCacheFile=<file>           Save/load the serialized global timing cache
  --preview=features                 Specify preview feature to be used by adding (+) or removing (-) preview features from the default
                                   Preview Features: features ::= [","feature]
                                                       feature  ::= (+|-)flag
                                                     flag     ::= "fasterDynamicShapes0805"
                                                                  |"disableExternalTacticSourcesForCore0805"
                                                                  |"profileSharing0806"
  --builderOptimizationLevel         Set the builder optimization level. (default is 3)
                                     Higher level allows TensorRT to spend more building time for more optimization options.
                                     Valid values include integers from 0 to the maximum optimization level, which is currently 5.
  --hardwareCompatibilityLevel=mode  Make the engine file compatible with other GPU architectures. (default = none)
                                   Hardware Compatibility Level: mode ::= "none" | "ampere+"
                                         none = no compatibility
                                         ampere+ = compatible with Ampere and newer GPUs
  --tempdir=<dir>                    Overrides the default temporary directory TensorRT will use when creating temporary files.
                                     See IRuntime::setTemporaryDirectory API documentation for more information.
  --tempfileControls=controls        Controls what TensorRT is allowed to use when creating temporary executable files.
                                     Should be a comma-separated list with entries in the format (in_memory|temporary):(allow|deny).
                                     in_memory: Controls whether TensorRT is allowed to create temporary in-memory executable files.
                                     temporary: Controls whether TensorRT is allowed to create temporary executable files in the
                                                filesystem (in the directory given by --tempdir).
                                     For example, to allow in-memory files and disallow temporary files:
                                         --tempfileControls=in_memory:allow,temporary:deny
                                   If a flag is unspecified, the default behavior is "allow".
  --maxAuxStreams=N                  Set maximum number of auxiliary streams per inference stream that TRT is allowed to use to run 
                                     kernels in parallel if the network contains ops that can run in parallel, with the cost of more 
                                     memory usage. Set this to 0 for optimal memory usage. (default = using heuristics)

=== Inference Options ===
  --batch=N                   Set batch size for implicit batch engines (default = 1)
                              This option should not be used when the engine is built from an ONNX model or when dynamic
                              shapes are provided when the engine is built.
  --shapes=spec               Set input shapes for dynamic shapes inference inputs.
                              Note: Input names can be wrapped with escaped single quotes (ex: 'Input:0').
                              Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                            Input values spec ::= Ival[","spec]
                                         Ival ::= name":"file
  --iterations=N              Run at least N inference iterations (default = 10)
  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
                              If -1 is specified, inference will keep running unless stopped manually
  --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  --idleTime=N                Sleep N milliseconds between two continuous iterations(default = 0)
  --infStreams=N              Instantiate N engines to run inference concurrently (default = 1)
  --exposeDMA                 Serialize DMA transfers to and from device (default = disabled).
  --noDataTransfers           Disable DMA transfers to and from device (default = enabled).
  --useManagedMemory          Use managed memory instead of separate host and device allocations (default = disabled).
  --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  --threads                   Enable multithreading to drive engines with independent threads or speed up refitting (default = disabled) 
  --useCudaGraph              Use CUDA graph to capture engine execution and then launch inference (default = disabled).
                              This flag may be ignored if the graph capture fails.
  --timeDeserialize           Time the amount of time it takes to deserialize the network and exit.
  --timeRefit                 Time the amount of time it takes to refit the engine before inference.
  --separateProfileRun        Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
  --skipInference             Exit after the engine has been built and skip inference perf measurement (default = disabled)
  --persistentCacheRatio      Set the persistentCacheLimit in ratio, 0.5 represent half of max persistent L2 size (default = 0)

=== Build and Inference Batch Options ===
                              When using implicit batch, the max batch size of the engine, if not given, 
                              is set to the inference batch size;
                              when using explicit batch, if shapes are specified only for inference, they 
                              will be used also as min/opt/max in the build profile; if shapes are 
                              specified only for the build, the opt shapes will be used also for inference;
                              if both are specified, they must be compatible; and if explicit batch is 
                              enabled but neither is specified, the model must provide complete static
                              dimensions, including batch size, for all inputs
                              Using ONNX models automatically forces explicit batch.

=== Reporting Options ===
  --verbose                   Use verbose logging (default = false)
  --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  --percentile=P1,P2,P3,...   Report performance for the P1,P2,P3,... percentages (0<=P_i<=100, 0 representing max perf, and 100 representing min perf; (default = 90,95,99%)
  --dumpRefit                 Print the refittable layers and weights from a refittable engine
  --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  --dumpRawBindingsToFile     Print the input/output tensor(s) of the last inference iteration to file(default = disabled)
  --dumpProfile               Print profile information per layer (default = disabled)
  --dumpLayerInfo             Print layer information of the engine to console (default = disabled)
  --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
  --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)
  --exportLayerInfo=<file>    Write the layer information of the engine in a json file (default = disabled)

=== System Options ===
  --device=N                  Select cuda device N (default = 0)
  --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  --staticPlugins             Plugin library (.so) to load statically (can be specified multiple times)
  --dynamicPlugins            Plugin library (.so) to load dynamically and may be serialized with the engine if they are included in --setPluginsToSerialize (can be specified multiple times)
  --setPluginsToSerialize     Plugin library (.so) to be serialized with the engine (can be specified multiple times)
  --ignoreParsedPluginLibs    By default, when building a version-compatible engine, plugin libraries specified by the ONNX parser 
                              are implicitly serialized with the engine (unless --excludeLeanRuntime is specified) and loaded dynamically. 
                              Enable this flag to ignore these plugin libraries instead.

=== Help ===
  --help, -h                  Print this message
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # /usr/src/tensorrt/bin/trtexec --version

Let's focus on issues, what is your problem about trtexec convert onnx ? If you use trtexec, add para --verbose and then make output redirect to txt on disk, and upload file here.

@yjiangling Is there any code that will generate the calibration.cache file?

NVIDIA / TensorRT