NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.77k stars 2.13k forks source link

trtexec segfault on AGX 64 only #1995

Closed joihn closed 2 years ago

joihn commented 2 years ago

I'm converting a ONNX model to .engine trtexec --fp16 --onnx=/home/maxime/model.onnx --saveEngine=out.engine

This commands used to work well on nvidia xavier AGX 32gb with jetpack 4.6. However, I recently upgraded to nvidia xavier AGX 64gb and I have the following segfault (tested both on jetpack 4.6 and 4.6.1)

trtexec  --fp16  --onnx=/home/maxime/fp16-weights-291-0.89.onnx --saveEngine=OUTI.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --fp16 --onnx=/home/maxime/model.onnx --saveEngine=OUTI.engine
[05/20/2022-15:18:47] [I] === Model Options ===
[05/20/2022-15:18:47] [I] Format: ONNX
[05/20/2022-15:18:47] [I] Model: /home/maxime/fp16-weights-291-0.89.onnx
[05/20/2022-15:18:47] [I] Output:
[05/20/2022-15:18:47] [I] === Build Options ===
[05/20/2022-15:18:47] [I] Max batch: explicit batch
[05/20/2022-15:18:47] [I] Workspace: 16 MiB
[05/20/2022-15:18:47] [I] minTiming: 1
[05/20/2022-15:18:47] [I] avgTiming: 8
[05/20/2022-15:18:47] [I] Precision: FP32+FP16
[05/20/2022-15:18:47] [I] Calibration: 
[05/20/2022-15:18:47] [I] Refit: Disabled
[05/20/2022-15:18:47] [I] Sparsity: Disabled
[05/20/2022-15:18:47] [I] Safe mode: Disabled
[05/20/2022-15:18:47] [I] DirectIO mode: Disabled
[05/20/2022-15:18:47] [I] Restricted mode: Disabled
[05/20/2022-15:18:47] [I] Save engine: OUTI.engine
[05/20/2022-15:18:47] [I] Load engine: 
[05/20/2022-15:18:47] [I] Profiling verbosity: 0
[05/20/2022-15:18:47] [I] Tactic sources: Using default tactic sources
[05/20/2022-15:18:47] [I] timingCacheMode: local
[05/20/2022-15:18:47] [I] timingCacheFile: 
[05/20/2022-15:18:47] [I] Input(s)s format: fp32:CHW
[05/20/2022-15:18:47] [I] Output(s)s format: fp32:CHW
[05/20/2022-15:18:47] [I] Input build shapes: model
[05/20/2022-15:18:47] [I] Input calibration shapes: model
[05/20/2022-15:18:47] [I] === System Options ===
[05/20/2022-15:18:47] [I] Device: 0
[05/20/2022-15:18:47] [I] DLACore: 
[05/20/2022-15:18:47] [I] Plugins:
[05/20/2022-15:18:47] [I] === Inference Options ===
[05/20/2022-15:18:47] [I] Batch: Explicit
[05/20/2022-15:18:47] [I] Input inference shapes: model
[05/20/2022-15:18:47] [I] Iterations: 10
[05/20/2022-15:18:47] [I] Duration: 3s (+ 200ms warm up)
[05/20/2022-15:18:47] [I] Sleep time: 0ms
[05/20/2022-15:18:47] [I] Idle time: 0ms
[05/20/2022-15:18:47] [I] Streams: 1
[05/20/2022-15:18:47] [I] ExposeDMA: Disabled
[05/20/2022-15:18:47] [I] Data transfers: Enabled
[05/20/2022-15:18:47] [I] Spin-wait: Disabled
[05/20/2022-15:18:47] [I] Multithreading: Disabled
[05/20/2022-15:18:47] [I] CUDA Graph: Disabled
[05/20/2022-15:18:47] [I] Separate profiling: Disabled
[05/20/2022-15:18:47] [I] Time Deserialize: Disabled
[05/20/2022-15:18:47] [I] Time Refit: Disabled
[05/20/2022-15:18:47] [I] Skip inference: Disabled
[05/20/2022-15:18:47] [I] Inputs:
[05/20/2022-15:18:47] [I] === Reporting Options ===
[05/20/2022-15:18:47] [I] Verbose: Disabled
[05/20/2022-15:18:47] [I] Averages: 10 inferences
[05/20/2022-15:18:47] [I] Percentile: 99
[05/20/2022-15:18:47] [I] Dump refittable layers:Disabled
[05/20/2022-15:18:47] [I] Dump output: Disabled
[05/20/2022-15:18:47] [I] Profile: Disabled
[05/20/2022-15:18:47] [I] Export timing to JSON file: 
[05/20/2022-15:18:47] [I] Export output to JSON file: 
[05/20/2022-15:18:47] [I] Export profile to JSON file: 
[05/20/2022-15:18:47] [I] 
Segmentation fault (core dumped)

Environment

TensorRT Version: TensorRT v8201

CUDA Version: cuda_10.2 Operating System: jetpack 4.6.1

Hardware

NVIDIA GPU: xavier AGX64 with a CTI carrier board (ref AGX111)

zerollzeng commented 2 years ago

This is due to a missing target for the TRT build, and it will be fixed in jetpack 5.0

joihn commented 2 years ago

I suspected the carrier board software patch to be somewhat responsible.

Therefore I tried with another carrier board I had (salvaged from a Xavier32GB devkit, reference 945-82972-0045-0000). and reflashed everything with jetpack 4.6.1 without any issue.

However, when executing trtexec --fp16 --onnx=/home/maxime/model.onnx --saveEngine=out.engine I have the following error (different from the previous one):

[05/24/2022-03:19:03] [E] Error[2]: [utils.cpp::checkMemLimit::380] Error Code 2: Internal Error (Assertion upperBound != 0 failed. Unknown embedded device detected. Please update the table with the entry: {{1794, 8, 64}, 51309},)
[05/24/2022-03:19:03] [E] Error[2]: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
[05/24/2022-03:19:03] [E] Engine could not be created from network
[05/24/2022-03:19:03] [E] Building engine failed
[05/24/2022-03:19:04] [E] Failed to create engine from model.
[05/24/2022-03:19:04] [E] Engine set up failed

It indeed looks like a missing target for the TRT build.

a ) If I understand correctly, the whole TensorRT suit is incompatible with Xavier AGX 64GB, and there is no plan to support it on jetpack 4.6.x ? It's very suprising since this nvidia article claims the opposite.

b) is there any workaround for jetpack 4.6.x ?

nvpohanh commented 2 years ago

The OS does support AGX 64GB but TRT was not tested on AGX 64GB, so we didn't catch this failure. As Zero said, this will be fixed in JP5.0 and for JP4.6.x, I think the only workaround is to somehow limit the memory to 32GB so that the target checking logic in TRT doesn't fail.

nvpohanh commented 2 years ago

In the latest TRT, we have changed that part of the logic so that it still works if the device is not in the pre-defined list. So if there are new devices coming out in the future, TRT will not fail at this part again.

nvpohanh commented 2 years ago

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

koiking213 commented 2 years ago

I also faced this issue for Jetson AGX Xavier 64GB model with JetPack 4.6.2. Currently there is no JetPack 5.x compatible BSP for the carrier board I'm using.

@nvpohanh

I think the only workaround is to somehow limit the memory to 32GB so that the target checking logic in TRT doesn't fail.

Do you have any idea about how to do this? I have tried to limit kernel memory when docker run by adding --kernel-memory=32gb, but it did not work.