DLA Engine Build Failed on Orin

Nik-V9 commented 1 year ago

Description

Building a DLA Engine using Torch_TensorRT failed on the Orin with the following result:

ERROR: [Torch-TensorRT TorchScript Conversion Context] - 1: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 2: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
RuntimeError: [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:147] Building serialized network failed in TensorRT

Trying to build HRNet-W32 model on DLA with a input shape of [1, 2, 2048, 2560].

Environment

TensorRT Version: 8.4.0-1+cuda11.4 JetPack Version: 5.0.1dp NVIDIA GPU: Orin NVIDIA Driver Version: CUDA Version: 11.4 CUDNN Version: Operating System: Ubuntu 20.04.5 LTS

Relevant Files

The Code is available in this repo: https://github.com/castacks/daa_profile

Compiling the TRT Engine on DLA here: https://github.com/castacks/daa_profile/blob/b7c592788671101821cad9a01fc27539f72e22e5/model.py#L103

Steps To Reproduce

Clone Github Repo on Orin.
Follow Download Instructions in Github ReadME and unzip files on Orin.
Run the following command within the repo: python3 profile.py profile.yaml

Running exp profile.yaml
Loaded config:
{'input_dir': '/home/airlab/DAA/extrinsic', 'num_cam': 1, 'trt_batch_size': 1, 'width': 2448, 'height': 2048, 'padding': 56, 'use_tensorrt': True, 'use_torch2trt': False, 'use_dla': True, 'input_frames': 2, 'models_dir': '/home/airlab/DAA/models', 'full_res_model_chkpt': '120_hrnet32_all_2220.pth', 'fp16_mode': True, 'int8_mode': False}
Compiling TensorRT detector model ...
WARNING: [Torch-TensorRT] - Setting GPU id to 0 for device because device 0 manages DLA on Xavier
WARNING: [Torch-TensorRT TorchScript Conversion Context] - DLA LAYER: CBUF size requirement for layer %input.87 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.0.0.weight, %4054, %7, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.0/__module.base_model.transition1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 exceeds the limit.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Default DLA is enabled but layer %input.87 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.0.0.weight, %4054, %7, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.0/__module.base_model.transition1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 is not supported on DLA, falling back to GPU.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - DLA LAYER: CBUF size requirement for layer %input.91 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.1.0.0.weight, %4054, %4055, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.1/__module.base_model.transition1.1.0/__module.base_model.transition1.1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 exceeds the limit.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Default DLA is enabled but layer %input.91 : Tensor = aten::_convolution(%5626, %self.base_model.transition1.1.0.0.weight, %4054, %4055, %7, %7, %10, %6, %11, %9, %10, %9, %9), scope: __module.base_model/__module.base_model.transition1.1/__module.base_model.transition1.1.0/__module.base_model.transition1.1.0.0 # /home/airlab/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py:443:0 is not supported on DLA, falling back to GPU.
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 1: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 2: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
  File "profile.py", line 127, in <module>
    run_evaluation(args.experiment)
  File "profile.py", line 103, in run_evaluation
    detector = SegDetector(cfg=dict(
  File "/home/airlab/DAA/daa_profile/model.py", line 103, in __init__
    trt_engine = torch_tensorrt.ts.convert_method_to_trt_engine(
  File "/home/airlab/.local/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 199, in convert_method_to_trt_engine
    return _C.convert_graph_to_trt_engine(module._c, method_name, _parse_compile_spec(compile_spec))
RuntimeError: [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:147] Building serialized network failed in TensorRT

zerollzeng commented 1 year ago

Usually it's cause by some DLA bad layers, I'm not an expert in Torch-TRT, could you please export the model to ONNX and try it with trtexec(I think it should be like /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --useDLACore=0 --allowGPUFallback)? thanks!

Nik-V9 commented 1 year ago

Hi, Thanks for the follow up!

I converted the pytorch model to ONNX and tried building the engine with trtexec. I got the same error as Torch-TensorRT.

trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine

The ONNX model is available here: https://drive.google.com/file/d/1uBvJJ8DTDEVRsNTUX8zDtAp9T4REQbwZ/view?usp=share_link

Stacktrace:

&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine
[10/31/2022-09:57:32] [I] === Model Options ===
[10/31/2022-09:57:32] [I] Format: ONNX
[10/31/2022-09:57:32] [I] Model: model_b_1.onnx
[10/31/2022-09:57:32] [I] Output:
[10/31/2022-09:57:32] [I] === Build Options ===
[10/31/2022-09:57:32] [I] Max batch: explicit batch
[10/31/2022-09:57:32] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/31/2022-09:57:32] [I] minTiming: 1
[10/31/2022-09:57:32] [I] avgTiming: 8
[10/31/2022-09:57:32] [I] Precision: FP32+FP16
[10/31/2022-09:57:32] [I] LayerPrecisions: 
[10/31/2022-09:57:32] [I] Calibration: 
[10/31/2022-09:57:32] [I] Refit: Disabled
[10/31/2022-09:57:32] [I] Sparsity: Disabled
[10/31/2022-09:57:32] [I] Safe mode: Disabled
[10/31/2022-09:57:32] [I] DirectIO mode: Disabled
[10/31/2022-09:57:32] [I] Restricted mode: Disabled
[10/31/2022-09:57:32] [I] Build only: Disabled
[10/31/2022-09:57:32] [I] Save engine: model_onnx_dla_b_1.engine
[10/31/2022-09:57:32] [I] Load engine: 
[10/31/2022-09:57:32] [I] Profiling verbosity: 0
[10/31/2022-09:57:32] [I] Tactic sources: Using default tactic sources
[10/31/2022-09:57:32] [I] timingCacheMode: local
[10/31/2022-09:57:32] [I] timingCacheFile: 
[10/31/2022-09:57:32] [I] Input(s)s format: fp32:CHW
[10/31/2022-09:57:32] [I] Output(s)s format: fp32:CHW
[10/31/2022-09:57:32] [I] Input build shapes: model
[10/31/2022-09:57:32] [I] Input calibration shapes: model
[10/31/2022-09:57:32] [I] === System Options ===
[10/31/2022-09:57:32] [I] Device: 0
[10/31/2022-09:57:32] [I] DLACore: 0(With GPU fallback)
[10/31/2022-09:57:32] [I] Plugins:
[10/31/2022-09:57:32] [I] === Inference Options ===
[10/31/2022-09:57:32] [I] Batch: Explicit
[10/31/2022-09:57:32] [I] Input inference shapes: model
[10/31/2022-09:57:32] [I] Iterations: 10
[10/31/2022-09:57:32] [I] Duration: 3s (+ 200ms warm up)
[10/31/2022-09:57:32] [I] Sleep time: 0ms
[10/31/2022-09:57:32] [I] Idle time: 0ms
[10/31/2022-09:57:32] [I] Streams: 1
[10/31/2022-09:57:32] [I] ExposeDMA: Disabled
[10/31/2022-09:57:32] [I] Data transfers: Enabled
[10/31/2022-09:57:32] [I] Spin-wait: Disabled
[10/31/2022-09:57:32] [I] Multithreading: Disabled
[10/31/2022-09:57:32] [I] CUDA Graph: Disabled
[10/31/2022-09:57:32] [I] Separate profiling: Disabled
[10/31/2022-09:57:32] [I] Time Deserialize: Disabled
[10/31/2022-09:57:32] [I] Time Refit: Disabled
[10/31/2022-09:57:32] [I] Inputs:
[10/31/2022-09:57:32] [I] === Reporting Options ===
[10/31/2022-09:57:32] [I] Verbose: Disabled
[10/31/2022-09:57:32] [I] Averages: 10 inferences
[10/31/2022-09:57:32] [I] Percentile: 99
[10/31/2022-09:57:32] [I] Dump refittable layers:Disabled
[10/31/2022-09:57:32] [I] Dump output: Disabled
[10/31/2022-09:57:32] [I] Profile: Disabled
[10/31/2022-09:57:32] [I] Export timing to JSON file: 
[10/31/2022-09:57:32] [I] Export output to JSON file: 
[10/31/2022-09:57:32] [I] Export profile to JSON file: 
[10/31/2022-09:57:32] [I] 
[10/31/2022-09:57:32] [I] === Device Information ===
[10/31/2022-09:57:32] [I] Selected Device: Orin
[10/31/2022-09:57:32] [I] Compute Capability: 8.7
[10/31/2022-09:57:32] [I] SMs: 16
[10/31/2022-09:57:32] [I] Compute Clock Rate: 1.3 GHz
[10/31/2022-09:57:32] [I] Device Global Memory: 30622 MiB
[10/31/2022-09:57:32] [I] Shared Memory per SM: 164 KiB
[10/31/2022-09:57:32] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/31/2022-09:57:32] [I] Memory Clock Rate: 1.3 GHz
[10/31/2022-09:57:32] [I] 
[10/31/2022-09:57:32] [I] TensorRT version: 8.4.0
[10/31/2022-09:57:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +302, GPU +0, now: CPU 327, GPU 9849 (MiB)
[10/31/2022-09:57:37] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +403, GPU +378, now: CPU 749, GPU 10245 (MiB)
[10/31/2022-09:57:37] [I] Start parsing network model
[10/31/2022-09:57:37] [I] [TRT] ----------------------------------------------------------------
[10/31/2022-09:57:37] [I] [TRT] Input filename:   model_b_1.onnx
[10/31/2022-09:57:37] [I] [TRT] ONNX IR version:  0.0.6
[10/31/2022-09:57:37] [I] [TRT] Opset version:    11
[10/31/2022-09:57:37] [I] [TRT] Producer name:    pytorch
[10/31/2022-09:57:37] [I] [TRT] Producer version: 1.11.0
[10/31/2022-09:57:37] [I] [TRT] Domain:           
[10/31/2022-09:57:37] [I] [TRT] Model version:    0
[10/31/2022-09:57:37] [I] [TRT] Doc string:       
[10/31/2022-09:57:37] [I] [TRT] ----------------------------------------------------------------
[10/31/2022-09:57:37] [W] [TRT] onnx2trt_utils.cpp:363: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/31/2022-09:57:37] [I] Finish parsing network model
[10/31/2022-09:57:37] [W] Dynamic dimensions required for input: input_0, but no shapes were provided. Automatically overriding shape to: 1x2x2048x2560
[10/31/2022-09:57:37] [W] [TRT] DLA LAYER: CBUF size requirement for layer Conv_33 exceeds the limit.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer Conv_33 is not supported on DLA, falling back to GPU.
[10/31/2022-09:57:37] [W] [TRT] DLA LAYER: CBUF size requirement for layer Conv_35 exceeds the limit.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer Conv_35 is not supported on DLA, falling back to GPU.
[10/31/2022-09:57:37] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 774) [Identity] is not supported on DLA, falling back to GPU.
[10/31/2022-10:04:05] [I] [TRT] ---------- Layers Running on DLA ----------
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Relu_32]}
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Relu_34...Relu_769]}
[10/31/2022-10:04:05] [I] [TRT] [DlaLayer] {ForeignNode[Conv_770...Conv_820]}
[10/31/2022-10:04:05] [I] [TRT] ---------- Layers Running on GPU ----------
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_33
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_35
[10/31/2022-10:04:05] [I] [TRT] [GpuLayer] CAST: (Unnamed Layer* 774) [Identity]
[10/31/2022-10:04:06] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +121, now: CPU 1406, GPU 10889 (MiB)
[10/31/2022-10:04:06] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +78, now: CPU 1490, GPU 10967 (MiB)
[10/31/2022-10:04:06] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/31/2022-10:06:55] [E] Error[1]: [nvdlaUtils.cpp::submit::198] Error Code 1: DLA (Failure to submit program to DLA engine.)
[10/31/2022-10:06:55] [E] Error[2]: [builder.cpp::buildSerializedNetwork::620] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[10/31/2022-10:06:55] [E] Engine could not be created from network
[10/31/2022-10:06:55] [E] Building engine failed
[10/31/2022-10:06:55] [E] Failed to create engine from model or file.
[10/31/2022-10:06:55] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback --saveEngine=model_onnx_dla_b_1.engine

zerollzeng commented 1 year ago

[11/02/2022-11:46:03] [I] === Performance summary ===
[11/02/2022-11:46:03] [I] Throughput: 0.406047 qps
[11/02/2022-11:46:03] [I] Latency: min = 2240.97 ms, max = 2241.41 ms, mean = 2241.14 ms, median = 2241.13 ms, percentile(90%) = 2241.2 ms, percentile(95%) = 2241.41 ms, percentile(99%) = 2241.41 ms
[11/02/2022-11:46:03] [I] Enqueue Time: min = 0.967136 ms, max = 1.2666 ms, mean = 1.14105 ms, median = 1.13232 ms, percentile(90%) = 1.24414 ms, percentile(95%) = 1.2666 ms, percentile(99%) = 1.2666 ms
[11/02/2022-11:46:03] [I] H2D Latency: min = 1.78027 ms, max = 1.9176 ms, mean = 1.80922 ms, median = 1.80225 ms, percentile(90%) = 1.80664 ms, percentile(95%) = 1.9176 ms, percentile(99%) = 1.9176 ms
[11/02/2022-11:46:03] [I] GPU Compute Time: min = 2239.08 ms, max = 2239.32 ms, mean = 2239.16 ms, median = 2239.15 ms, percentile(90%) = 2239.22 ms, percentile(95%) = 2239.32 ms, percentile(99%) = 2239.32 ms
[11/02/2022-11:46:03] [I] D2H Latency: min = 0.101562 ms, max = 0.177734 ms, mean = 0.16792 ms, median = 0.174805 ms, percentile(90%) = 0.177734 ms, percentile(95%) = 0.177734 ms, percentile(99%) = 0.177734 ms
[11/02/2022-11:46:03] [I] Total Host Walltime: 24.6277 s
[11/02/2022-11:46:03] [I] Total GPU Compute Time: 22.3916 s
[11/02/2022-11:46:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/02/2022-11:46:03] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8501] # trtexec --onnx=model_b_1.onnx --fp16 --useDLACore=0 --allowGPUFallback

In my test the issue is fixed in the latest TRT 8.5. can you wait for the new Jetpack release that contains it? thanks!

Nik-V9 commented 1 year ago

Oh, Thanks for the response! When can I expect the new Jetpack with TRT 8.5 to be released?

zerollzeng commented 1 year ago

The next JP 5.1 I think, but I'm not sure about the exact release date. please stay in tune :-)

NVIDIA / TensorRT