Closed Nik-V9 closed 1 year ago
Usually it's cause by some DLA bad layers, I'm not an expert in Torch-TRT, could you please export the model to ONNX and try it with trtexec(I think it should be like /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --useDLACore=0 --allowGPUFallback
)? thanks!
Hi, Thanks for the follow up!
I converted the pytorch model to ONNX and tried building the engine with trtexec. I got the same error as Torch-TensorRT.
trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine
The ONNX model is available here: https://drive.google.com/file/d/1uBvJJ8DTDEVRsNTUX8zDtAp9T4REQbwZ/view?usp=share_link
Stacktrace:
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine
[10/31/2022-09:57:32] [I] === Model Options ===
[10/31/2022-09:57:32] [I] Format: ONNX
[10/31/2022-09:57:32] [I] Model: model_b_1.onnx
[10/31/2022-09:57:32] [I] Output:
[10/31/2022-09:57:32] [I] === Build Options ===
[10/31/2022-09:57:32] [I] Max batch: explicit batch
[10/31/2022-09:57:32] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/31/2022-09:57:32] [I] minTiming: 1
[10/31/2022-09:57:32] [I] avgTiming: 8
[10/31/2022-09:57:32] [I] Precision: FP32+FP16
[10/31/2022-09:57:32] [I] LayerPrecisions:
[10/31/2022-09:57:32] [I] Calibration:
[10/31/2022-09:57:32] [I] Refit: Disabled
[10/31/2022-09:57:32] [I] Sparsity: Disabled
[10/31/2022-09:57:32] [I] Safe mode: Disabled
[10/31/2022-09:57:32] [I] DirectIO mode: Disabled
[10/31/2022-09:57:32] [I] Restricted mode: Disabled
[10/31/2022-09:57:32] [I] Build only: Disabled
[10/31/2022-09:57:32] [I] Save engine: model_onnx_dla_b_1.engine
[10/31/2022-09:57:32] [I] Load engine:
[10/31/2022-09:57:32] [I] Profiling verbosity: 0
[10/31/2022-09:57:32] [I] Tactic sources: Using default tactic sources
[10/31/2022-09:57:32] [I] timingCacheMode: local
[10/31/2022-09:57:32] [I] timingCacheFile:
[10/31/2022-09:57:32] [I] Input(s)s format: fp32:CHW
[10/31/2022-09:57:32] [I] Output(s)s format: fp32:CHW
[10/31/2022-09:57:32] [I] Input build shapes: model
[10/31/2022-09:57:32] [I] Input calibration shapes: model
[10/31/2022-09:57:32] [I] === System Options ===
[10/31/2022-09:57:32] [I] Device: 0
[10/31/2022-09:57:32] [I] DLACore: 0(With GPU fallback)
[10/31/2022-09:57:32] [I] Plugins:
[10/31/2022-09:57:32] [I] === Inference Options ===
[10/31/2022-09:57:32] [I] Batch: Explicit
[10/31/2022-09:57:32] [I] Input inference shapes: model
[10/31/2022-09:57:32] [I] Iterations: 10
[10/31/2022-09:57:32] [I] Duration: 3s (+ 200ms warm up)
[10/31/2022-09:57:32] [I] Sleep time: 0ms
[10/31/2022-09:57:32] [I] Idle time: 0ms
[10/31/2022-09:57:32] [I] Streams: 1
[10/31/2022-09:57:32] [I] ExposeDMA: Disabled
[10/31/2022-09:57:32] [I] Data transfers: Enabled
[10/31/2022-09:57:32] [I] Spin-wait: Disabled
[10/31/2022-09:57:32] [I] Multithreading: Disabled
[10/31/2022-09:57:32] [I] CUDA Graph: Disabled
[10/31/2022-09:57:32] [I] Separate profiling: Disabled
[10/31/2022-09:57:32] [I] Time Deserialize: Disabled
[10/31/2022-09:57:32] [I] Time Refit: Disabled
[10/31/2022-09:57:32] [I] Inputs:
[10/31/2022-09:57:32] [I] === Reporting Options ===
[10/31/2022-09:57:32] [I] Verbose: Disabled
[10/31/2022-09:57:32] [I] Averages: 10 inferences
[10/31/2022-09:57:32] [I] Percentile: 99
[10/31/2022-09:57:32] [I] Dump refittable layers:Disabled
[10/31/2022-09:57:32] [I] Dump output: Disabled
[10/31/2022-09:57:32] [I] Profile: Disabled
[10/31/2022-09:57:32] [I] Export timing to JSON file:
[10/31/2022-09:57:32] [I] Export output to JSON file:
[10/31/2022-09:57:32] [I] Export profile to JSON file:
[10/31/2022-09:57:32] [I]
[10/31/2022-09:57:32] [I] === Device Information ===
[10/31/2022-09:57:32] [I] Selected Device: Orin
[10/31/2022-09:57:32] [I] Compute Capability: 8.7
[10/31/2022-09:57:32] [I] SMs: 16
[10/31/2022-09:57:32] [I] Compute Clock Rate: 1.3 GHz
[10/31/2022-09:57:32] [I] Device Global Memory: 30622 MiB
[10/31/2022-09:57:32] [I] Shared Memory per SM: 164 KiB
[10/31/2022-09:57:32] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/31/2022-09:57:32] [I] Memory Clock Rate: 1.3 GHz
[10/31/2022-09:57:32] [I]
[10/31/2022-09:57:32] [I] TensorRT version: 8.4.0
[10/31/2022-09:57:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +302, GPU +0, now: CPU 327, GPU 9849 (MiB)
[10/31/2022-09:57:37] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +403, GPU +378, now: CPU 749, GPU 10245 (MiB)
[10/31/2022-09:57:37] [I] Start parsing network model
[10/31/2022-09:57:37] [I] [TRT] ----------------------------------------------------------------
[10/31/2022-09:57:37] [I] [TRT] Input filename: model_b_1.onnx
[10/31/2022-09:57:37] [I] [TRT] ONNX IR version: 0.0.6
[10/31/2022-09:57:37] [I] [TRT] Opset version: 11
[10/31/2022-09:57:37] [I] [TRT] Producer name: pytorch
[10/31/2022-09:57:37] [I] [TRT] Producer version: 1.11.0
[10/31/2022-09:57:37] [I] [TRT] Domain:
[10/31/2022-09:57:37] [I] [TRT] Model version: 0
[10/31/2022-09:57:37] [I] [TRT] Doc string:
[10/31/2022-09:57:37] [I] [TRT] ----------------------------------------------------------------
[10/31/2022-09:57:37] [W] [TRT] onnx2trt_utils.cpp:363: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/31/2022-09:57:37] [I] Finish parsing network model
[10/31/2022-09:57:37] [W] Dynamic dimensions required for input: input_0, but no shapes were provided. Automatically overriding shape to: 1x2x2048x2560
[10/31/2022-09:57:37] [W] [TRT] DLA LAYER: CBUF size requirement for layer Conv_33 exceeds the limit.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer Conv_33 is not supported on DLA, falling back to GPU.
[10/31/2022-09:57:37] [W] [TRT] DLA LAYER: CBUF size requirement for layer Conv_35 exceeds the limit.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer Conv_35 is not supported on DLA, falling back to GPU.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 774) [Identity] is not supported on DLA, falling back to GPU.
[10/31/2022-10:04:05] [I] [TRT] ---------- Layers Running on DLA ----------
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Relu_32]}
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Relu_34...Relu_769]}
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Conv_770...Conv_820]}
[10/31/2022-10:04:05] [I] [TRT] ---------- Layers Running on GPU ----------
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_33
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_35
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CAST: (Unnamed Layer* 774) [Identity]
[10/31/2022-10:04:06] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +121, now: CPU 1406, GPU 10889 (MiB)
[10/31/2022-10:04:06] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +78, now: CPU 1490, GPU 10967 (MiB)
[10/31/2022-10:04:06] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/31/2022-10:06:55] [E] Error[1]: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
[10/31/2022-10:06:55] [E] Error[2]: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[10/31/2022-10:06:55] [E] Engine could not be created from network
[10/31/2022-10:06:55] [E] Building engine failed
[10/31/2022-10:06:55] [E] Failed to create engine from model or file.
[10/31/2022-10:06:55] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine
[11/02/2022-11:46:03] [I] === Performance summary ===
[11/02/2022-11:46:03] [I] Throughput: 0.406047 qps
[11/02/2022-11:46:03] [I] Latency: min = 2240.97 ms, max = 2241.41 ms, mean = 2241.14 ms, median = 2241.13 ms, percentile(90%) = 2241.2 ms, percentile(95%) = 2241.41 ms, percentile(99%) = 2241.41 ms
[11/02/2022-11:46:03] [I] Enqueue Time: min = 0.967136 ms, max = 1.2666 ms, mean = 1.14105 ms, median = 1.13232 ms, percentile(90%) = 1.24414 ms, percentile(95%) = 1.2666 ms, percentile(99%) = 1.2666 ms
[11/02/2022-11:46:03] [I] H2D Latency: min = 1.78027 ms, max = 1.9176 ms, mean = 1.80922 ms, median = 1.80225 ms, percentile(90%) = 1.80664 ms, percentile(95%) = 1.9176 ms, percentile(99%) = 1.9176 ms
[11/02/2022-11:46:03] [I] GPU Compute Time: min = 2239.08 ms, max = 2239.32 ms, mean = 2239.16 ms, median = 2239.15 ms, percentile(90%) = 2239.22 ms, percentile(95%) = 2239.32 ms, percentile(99%) = 2239.32 ms
[11/02/2022-11:46:03] [I] D2H Latency: min = 0.101562 ms, max = 0.177734 ms, mean = 0.16792 ms, median = 0.174805 ms, percentile(90%) = 0.177734 ms, percentile(95%) = 0.177734 ms, percentile(99%) = 0.177734 ms
[11/02/2022-11:46:03] [I] Total Host Walltime: 24.6277 s
[11/02/2022-11:46:03] [I] Total GPU Compute Time: 22.3916 s
[11/02/2022-11:46:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/02/2022-11:46:03] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8501] # trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback
In my test the issue is fixed in the latest TRT 8.5. can you wait for the new Jetpack release that contains it? thanks!
Oh, Thanks for the response! When can I expect the new Jetpack with TRT 8.5 to be released?
The next JP 5.1 I think, but I'm not sure about the exact release date. please stay in tune :-)
Description
Building a DLA Engine using Torch_TensorRT failed on the Orin with the following result:
Trying to build HRNet-W32 model on DLA with a input shape of [1, 2, 2048, 2560].
Environment
TensorRT Version: 8.4.0-1+cuda11.4 JetPack Version: 5.0.1dp NVIDIA GPU: Orin NVIDIA Driver Version: CUDA Version: 11.4 CUDNN Version: Operating System: Ubuntu 20.04.5 LTS
Relevant Files
The Code is available in this repo: https://github.com/castacks/daa_profile
Compiling the TRT Engine on DLA here: https://github.com/castacks/daa_profile/blob/b7c592788671101821cad9a01fc27539f72e22e5/model.py#L103
Steps To Reproduce
python3 profile.py profile.yaml