NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Apache License 2.0
10.57k stars 2.1k forks source link

DLA Engine Build Failed on Orin #2426

Closed Nik-V9 closed 1 year ago

Nik-V9 commented 1 year ago


Building a DLA Engine using Torch_TensorRT failed on the Orin with the following result:

ERROR: [Torch-TensorRT TorchScript Conversion Context] - 1: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 2: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
RuntimeError: [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:147] Building serialized network failed in TensorRT

Trying to build HRNet-W32 model on DLA with a input shape of [1, 2, 2048, 2560].


TensorRT Version: 8.4.0-1+cuda11.4 JetPack Version: 5.0.1dp NVIDIA GPU: Orin NVIDIA Driver Version: CUDA Version: 11.4 CUDNN Version: Operating System: Ubuntu 20.04.5 LTS

Relevant Files

The Code is available in this repo: https://github.com/castacks/daa_profile

Compiling the TRT Engine on DLA here: https://github.com/castacks/daa_profile/blob/b7c592788671101821cad9a01fc27539f72e22e5/model.py#L103

Steps To Reproduce

Running exp profile.yaml
Loaded config:
{'input_dir': '/home/airlab/DAA/extrinsic', 'num_cam': 1, 'trt_batch_size': 1, 'width': 2448, 'height': 2048, 'padding': 56, 'use_tensorrt': True, 'use_torch2trt': False, 'use_dla': True, 'input_frames': 2, 'models_dir': '/home/airlab/DAA/models', 'full_res_model_chkpt': '120_hrnet32_all_2220.pth', 'fp16_mode': True, 'int8_mode': False}
Compiling TensorRT detector model ...
WARNING: [Torch-TensorRT] - Setting GPU id to 0 for device because device 0 manages DLA on Xavier
WARNING: [Torch-TensorRT TorchScript Conversion Context] - DLA LAYER: CBUF size requirement for layer %input.87 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.0.0.weight, %4054, %7, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.0/__module.base_model.transition1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 exceeds the limit.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Default DLA is enabled but layer %input.87 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.0.0.weight, %4054, %7, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.0/__module.base_model.transition1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 is not supported on DLA, falling back to GPU.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - DLA LAYER: CBUF size requirement for layer %input.91 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.1.0.0.weight, %4054, %4055, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.1/__module.base_model.transition1.1.0/__module.base_model.transition1.1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 exceeds the limit.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Default DLA is enabled but layer %input.91 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.1.0.0.weight, %4054, %4055, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.1/__module.base_model.transition1.1.0/__module.base_model.transition1.1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 is not supported on DLA, falling back to GPU.
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 1: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 2: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
  File "profile.py", line 127, in <module>
  File "profile.py", line 103, in run_evaluation
    detector = SegDetector(cfg=dict(
  File "/home/airlab/DAA/daa_profile/model.py", line 103, in __init__
    trt_engine = torch_tensorrt.ts.convert_method_to_trt_engine(
  File "/home/airlab/.local/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 199, in convert_method_to_trt_engine
    return _C.convert_graph_to_trt_engine(module._c, method_name, _parse_compile_spec(compile_spec))
RuntimeError: [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:147] Building serialized network failed in TensorRT
zerollzeng commented 1 year ago

Usually it's cause by some DLA bad layers, I'm not an expert in Torch-TRT, could you please export the model to ONNX and try it with trtexec(I think it should be like /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --useDLACore=0 --allowGPUFallback)? thanks!

Nik-V9 commented 1 year ago

Hi, Thanks for the follow up!

I converted the pytorch model to ONNX and tried building the engine with trtexec. I got the same error as Torch-TensorRT.

trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine

The ONNX model is available here: https://drive.google.com/file/d/1uBvJJ8DTDEVRsNTUX8zDtAp9T4REQbwZ/view?usp=share_link


&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine
[10/31/2022-09:57:32] [I] === Model Options ===
[10/31/2022-09:57:32] [I] Format: ONNX
[10/31/2022-09:57:32] [I] Model: model_b_1.onnx
[10/31/2022-09:57:32] [I] Output:
[10/31/2022-09:57:32] [I] === Build Options ===
[10/31/2022-09:57:32] [I] Max batch: explicit batch
[10/31/2022-09:57:32] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/31/2022-09:57:32] [I] minTiming: 1
[10/31/2022-09:57:32] [I] avgTiming: 8
[10/31/2022-09:57:32] [I] Precision: FP32+FP16
[10/31/2022-09:57:32] [I] LayerPrecisions: 
[10/31/2022-09:57:32] [I] Calibration: 
[10/31/2022-09:57:32] [I] Refit: Disabled
[10/31/2022-09:57:32] [I] Sparsity: Disabled
[10/31/2022-09:57:32] [I] Safe mode: Disabled
[10/31/2022-09:57:32] [I] DirectIO mode: Disabled
[10/31/2022-09:57:32] [I] Restricted mode: Disabled
[10/31/2022-09:57:32] [I] Build only: Disabled
[10/31/2022-09:57:32] [I] Save engine: model_onnx_dla_b_1.engine
[10/31/2022-09:57:32] [I] Load engine: 
[10/31/2022-09:57:32] [I] Profiling verbosity: 0
[10/31/2022-09:57:32] [I] Tactic sources: Using default tactic sources
[10/31/2022-09:57:32] [I] timingCacheMode: local
[10/31/2022-09:57:32] [I] timingCacheFile: 
[10/31/2022-09:57:32] [I] Input(s)s format: fp32:CHW
[10/31/2022-09:57:32] [I] Output(s)s format: fp32:CHW
[10/31/2022-09:57:32] [I] Input build shapes: model
[10/31/2022-09:57:32] [I] Input calibration shapes: model
[10/31/2022-09:57:32] [I] === System Options ===
[10/31/2022-09:57:32] [I] Device: 0
[10/31/2022-09:57:32] [I] DLACore: 0(With GPU fallback)
[10/31/2022-09:57:32] [I] Plugins:
[10/31/2022-09:57:32] [I] === Inference Options ===
[10/31/2022-09:57:32] [I] Batch: Explicit
[10/31/2022-09:57:32] [I] Input inference shapes: model
[10/31/2022-09:57:32] [I] Iterations: 10
[10/31/2022-09:57:32] [I] Duration: 3s (+ 200ms warm up)
[10/31/2022-09:57:32] [I] Sleep time: 0ms
[10/31/2022-09:57:32] [I] Idle time: 0ms
[10/31/2022-09:57:32] [I] Streams: 1
[10/31/2022-09:57:32] [I] ExposeDMA: Disabled
[10/31/2022-09:57:32] [I] Data transfers: Enabled
[10/31/2022-09:57:32] [I] Spin-wait: Disabled
[10/31/2022-09:57:32] [I] Multithreading: Disabled
[10/31/2022-09:57:32] [I] CUDA Graph: Disabled
[10/31/2022-09:57:32] [I] Separate profiling: Disabled
[10/31/2022-09:57:32] [I] Time Deserialize: Disabled
[10/31/2022-09:57:32] [I] Time Refit: Disabled
[10/31/2022-09:57:32] [I] Inputs:
[10/31/2022-09:57:32] [I] === Reporting Options ===
[10/31/2022-09:57:32] [I] Verbose: Disabled
[10/31/2022-09:57:32] [I] Averages: 10 inferences
[10/31/2022-09:57:32] [I] Percentile: 99
[10/31/2022-09:57:32] [I] Dump refittable layers:Disabled
[10/31/2022-09:57:32] [I] Dump output: Disabled
[10/31/2022-09:57:32] [I] Profile: Disabled
[10/31/2022-09:57:32] [I] Export timing to JSON file: 
[10/31/2022-09:57:32] [I] Export output to JSON file: 
[10/31/2022-09:57:32] [I] Export profile to JSON file: 
[10/31/2022-09:57:32] [I] 
[10/31/2022-09:57:32] [I] === Device Information ===
[10/31/2022-09:57:32] [I] Selected Device: Orin
[10/31/2022-09:57:32] [I] Compute Capability: 8.7
[10/31/2022-09:57:32] [I] SMs: 16
[10/31/2022-09:57:32] [I] Compute Clock Rate: 1.3 GHz
[10/31/2022-09:57:32] [I] Device Global Memory: 30622 MiB
[10/31/2022-09:57:32] [I] Shared Memory per SM: 164 KiB
[10/31/2022-09:57:32] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/31/2022-09:57:32] [I] Memory Clock Rate: 1.3 GHz
[10/31/2022-09:57:32] [I] 
[10/31/2022-09:57:32] [I] TensorRT version: 8.4.0
[10/31/2022-09:57:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +302, GPU +0, now: CPU 327, GPU 9849 (MiB)
[10/31/2022-09:57:37] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +403, GPU +378, now: CPU 749, GPU 10245 (MiB)
[10/31/2022-09:57:37] [I] Start parsing network model
[10/31/2022-09:57:37] [I] [TRT] ----------------------------------------------------------------
[10/31/2022-09:57:37] [I] [TRT] Input filename:   model_b_1.onnx
[10/31/2022-09:57:37] [I] [TRT] ONNX IR version:  0.0.6
[10/31/2022-09:57:37] [I] [TRT] Opset version:    11
[10/31/2022-09:57:37] [I] [TRT] Producer name:    pytorch
[10/31/2022-09:57:37] [I] [TRT] Producer version: 1.11.0
[10/31/2022-09:57:37] [I] [TRT] Domain:           
[10/31/2022-09:57:37] [I] [TRT] Model version:    0
[10/31/2022-09:57:37] [I] [TRT] Doc string:       
[10/31/2022-09:57:37] [I] [TRT] ----------------------------------------------------------------
[10/31/2022-09:57:37] [W] [TRT] onnx2trt_utils.cpp:363: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/31/2022-09:57:37] [I] Finish parsing network model
[10/31/2022-09:57:37] [W] Dynamic dimensions required for input: input_0, but no shapes were provided. Automatically overriding shape to: 1x2x2048x2560
[10/31/2022-09:57:37] [W] [TRT] DLA LAYER: CBUF size requirement for layer Conv_33 exceeds the limit.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer Conv_33 is not supported on DLA, falling back to GPU.
[10/31/2022-09:57:37] [W] [TRT] DLA LAYER: CBUF size requirement for layer Conv_35 exceeds the limit.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer Conv_35 is not supported on DLA, falling back to GPU.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 774) [Identity] is not supported on DLA, falling back to GPU.
[10/31/2022-10:04:05] [I] [TRT] ---------- Layers Running on DLA ----------
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Relu_32]}
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Relu_34...Relu_769]}
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Conv_770...Conv_820]}
[10/31/2022-10:04:05] [I] [TRT] ---------- Layers Running on GPU ----------
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_33
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_35
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CAST: (Unnamed Layer* 774) [Identity]
[10/31/2022-10:04:06] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +121, now: CPU 1406, GPU 10889 (MiB)
[10/31/2022-10:04:06] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +78, now: CPU 1490, GPU 10967 (MiB)
[10/31/2022-10:04:06] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/31/2022-10:06:55] [E] Error[1]: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
[10/31/2022-10:06:55] [E] Error[2]: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[10/31/2022-10:06:55] [E] Engine could not be created from network
[10/31/2022-10:06:55] [E] Building engine failed
[10/31/2022-10:06:55] [E] Failed to create engine from model or file.
[10/31/2022-10:06:55] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine
zerollzeng commented 1 year ago
[11/02/2022-11:46:03] [I] === Performance summary ===
[11/02/2022-11:46:03] [I] Throughput: 0.406047 qps
[11/02/2022-11:46:03] [I] Latency: min = 2240.97 ms, max = 2241.41 ms, mean = 2241.14 ms, median = 2241.13 ms, percentile(90%) = 2241.2 ms, percentile(95%) = 2241.41 ms, percentile(99%) = 2241.41 ms
[11/02/2022-11:46:03] [I] Enqueue Time: min = 0.967136 ms, max = 1.2666 ms, mean = 1.14105 ms, median = 1.13232 ms, percentile(90%) = 1.24414 ms, percentile(95%) = 1.2666 ms, percentile(99%) = 1.2666 ms
[11/02/2022-11:46:03] [I] H2D Latency: min = 1.78027 ms, max = 1.9176 ms, mean = 1.80922 ms, median = 1.80225 ms, percentile(90%) = 1.80664 ms, percentile(95%) = 1.9176 ms, percentile(99%) = 1.9176 ms
[11/02/2022-11:46:03] [I] GPU Compute Time: min = 2239.08 ms, max = 2239.32 ms, mean = 2239.16 ms, median = 2239.15 ms, percentile(90%) = 2239.22 ms, percentile(95%) = 2239.32 ms, percentile(99%) = 2239.32 ms
[11/02/2022-11:46:03] [I] D2H Latency: min = 0.101562 ms, max = 0.177734 ms, mean = 0.16792 ms, median = 0.174805 ms, percentile(90%) = 0.177734 ms, percentile(95%) = 0.177734 ms, percentile(99%) = 0.177734 ms
[11/02/2022-11:46:03] [I] Total Host Walltime: 24.6277 s
[11/02/2022-11:46:03] [I] Total GPU Compute Time: 22.3916 s
[11/02/2022-11:46:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/02/2022-11:46:03] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8501] # trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback

In my test the issue is fixed in the latest TRT 8.5. can you wait for the new Jetpack release that contains it? thanks!

Nik-V9 commented 1 year ago

Oh, Thanks for the response! When can I expect the new Jetpack with TRT 8.5 to be released?

zerollzeng commented 1 year ago

The next JP 5.1 I think, but I'm not sure about the exact release date. please stay in tune :-)