dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.86k stars 2.98k forks source link

Segnet fails to load when using DLA, but PASSED using TensorRT.trtexec #1838

Open khsafkatamin opened 6 months ago

khsafkatamin commented 6 months ago

Hi @dusty-nv,

I am using a custom segnet model trained following the steps from Onixaz Pytorch Segmentation. I can run the model using Device GPU without any issues. But When I run with the Device DLA. I face the following issue...


[TRT]    =============== Computing costs for 
[TRT]    *************** Autotuning format combination: Half(1572864,524288,1024,1) -> Half(6144,512,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(1572864,524288,1024,1) -> Half(512,512:16,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,1:4,1024,1) -> Half(6144,512,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,1:4,1024,1) -> Half(512,512:16,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,524288:16,1024,1) -> Half(6144,512,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,524288:16,1024,1) -> Half(512,512:16,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    10: [optimizer.cpp::computeCosts::3728] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]}.)
[TRT]    device DLA_0, failed to build CUDA engine
[TRT]    device DLA_0, failed to load fcn_resnet18.onnx
[TRT]    segNet -- failed to load.
segnet:  failed to initialize segNet

I tested with trtexec. it does not show any error.

Can you please tell me what is the meaning of these errors?

[TRT]    ---------- Layers Running on DLA ----------
[TRT]    [DlaLayer] {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]}
[TRT]    ---------- Layers Running on GPU ----------
[TRT]    Trying to load shared library libcublas.so.11
[TRT]    Loaded shared library libcublas.so.11
[TRT]    Using cublas as plugin tactic source
[TRT]    Trying to load shared library libcublasLt.so.11
[TRT]    Loaded shared library libcublasLt.so.11
[TRT]    Using cublasLt as core library tactic source
[TRT]    [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +260, GPU +322, now: CPU 660, GPU 5053 (MiB)
[TRT]    Trying to load shared library libcudnn.so.8
[TRT]    Loaded shared library libcudnn.so.8
[TRT]    Using cuDNN as plugin tactic source
[TRT]    Using cuDNN as core library tactic source
[TRT]    [MemUsageChange] Init cuDNN: CPU +82, GPU +125, now: CPU 742, GPU 5178 (MiB)
[TRT]    Global timing cache in use. Profiling results in this builder pass will be stored.
[TRT]    Constructing optimization profile number 0 [1/1].
[TRT]    Reserving memory for host IO tensors. Host: 0 bytes
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 

The model seems to run without any error when tested with trtexec-

/usr/src/tensorrt/bin/trtexec --onnx=/home/galaxis/projects/amin/jetson-inference/data/pytorch-segmentation/fcn_resnet18.onnx --fp16 --useDLACore=0 --allowGPUFallback

Output:

[05/10/2024-12:24:42] [I] Output:
[05/10/2024-12:24:42] [I] === Build Options ===
[05/10/2024-12:24:42] [I] Max batch: explicit batch
[05/10/2024-12:24:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/10/2024-12:24:42] [I] minTiming: 1
[05/10/2024-12:24:42] [I] avgTiming: 8
[05/10/2024-12:24:42] [I] Precision: FP32+FP16
[05/10/2024-12:24:42] [I] LayerPrecisions: 
[05/10/2024-12:24:42] [I] Calibration: 
[05/10/2024-12:24:42] [I] Refit: Disabled
[05/10/2024-12:24:42] [I] Sparsity: Disabled
[05/10/2024-12:24:42] [I] Safe mode: Disabled
[05/10/2024-12:24:42] [I] DirectIO mode: Disabled
[05/10/2024-12:24:42] [I] Restricted mode: Disabled
[05/10/2024-12:24:42] [I] Build only: Disabled
[05/10/2024-12:24:42] [I] Save engine: 
[05/10/2024-12:24:42] [I] Load engine: 
[05/10/2024-12:24:42] [I] Profiling verbosity: 0
[05/10/2024-12:24:42] [I] Tactic sources: Using default tactic sources
[05/10/2024-12:24:42] [I] timingCacheMode: local
[05/10/2024-12:24:42] [I] timingCacheFile: 
[05/10/2024-12:24:42] [I] Heuristic: Disabled
[05/10/2024-12:24:42] [I] Preview Features: Use default preview flags.
[05/10/2024-12:24:42] [I] Input(s)s format: fp32:CHW
[05/10/2024-12:24:42] [I] Output(s)s format: fp32:CHW
[05/10/2024-12:24:42] [I] Input build shapes: model
[05/10/2024-12:24:42] [I] Input calibration shapes: model
[05/10/2024-12:24:42] [I] === System Options ===
[05/10/2024-12:24:42] [I] Device: 0
[05/10/2024-12:24:42] [I] DLACore: 0(With GPU fallback)
[05/10/2024-12:24:42] [I] Plugins:
[05/10/2024-12:24:42] [I] === Inference Options ===
[05/10/2024-12:24:42] [I] Batch: Explicit
[05/10/2024-12:24:42] [I] Input inference shapes: model
[05/10/2024-12:24:42] [I] Iterations: 10
[05/10/2024-12:24:42] [I] Duration: 3s (+ 200ms warm up)
[05/10/2024-12:24:42] [I] Sleep time: 0ms
[05/10/2024-12:24:42] [I] Idle time: 0ms
[05/10/2024-12:24:42] [I] Streams: 1
[05/10/2024-12:24:42] [I] ExposeDMA: Disabled
[05/10/2024-12:24:42] [I] Data transfers: Enabled
[05/10/2024-12:24:42] [I] Spin-wait: Disabled
[05/10/2024-12:24:42] [I] Multithreading: Disabled
[05/10/2024-12:24:42] [I] CUDA Graph: Disabled
[05/10/2024-12:24:42] [I] Separate profiling: Disabled
[05/10/2024-12:24:42] [I] Time Deserialize: Disabled
[05/10/2024-12:24:42] [I] Time Refit: Disabled
[05/10/2024-12:24:42] [I] NVTX verbosity: 0
[05/10/2024-12:24:42] [I] Persistent Cache Ratio: 0
[05/10/2024-12:24:42] [I] Inputs:
[05/10/2024-12:24:42] [I] === Reporting Options ===
[05/10/2024-12:24:42] [I] Verbose: Disabled
[05/10/2024-12:24:42] [I] Averages: 10 inferences
[05/10/2024-12:24:42] [I] Percentiles: 90,95,99
[05/10/2024-12:24:42] [I] Dump refittable layers:Disabled
[05/10/2024-12:24:42] [I] Dump output: Disabled
[05/10/2024-12:24:42] [I] Profile: Disabled
[05/10/2024-12:24:42] [I] Export timing to JSON file: 
[05/10/2024-12:24:42] [I] Export output to JSON file: 
[05/10/2024-12:24:42] [I] Export profile to JSON file: 
[05/10/2024-12:24:42] [I] 
[05/10/2024-12:24:42] [I] === Device Information ===
[05/10/2024-12:24:42] [I] Selected Device: Xavier
[05/10/2024-12:24:42] [I] Compute Capability: 7.2
[05/10/2024-12:24:42] [I] SMs: 8
[05/10/2024-12:24:42] [I] Compute Clock Rate: 1.377 GHz
[05/10/2024-12:24:42] [I] Device Global Memory: 31002 MiB
[05/10/2024-12:24:42] [I] Shared Memory per SM: 96 KiB
[05/10/2024-12:24:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[05/10/2024-12:24:42] [I] Memory Clock Rate: 1.377 GHz
[05/10/2024-12:24:42] [I] 
[05/10/2024-12:24:42] [I] TensorRT version: 8.5.2
[05/10/2024-12:24:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +187, GPU +0, now: CPU 216, GPU 5564 (MiB)
[05/10/2024-12:24:44] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +106, GPU +100, now: CPU 344, GPU 5678 (MiB)
[05/10/2024-12:24:44] [I] Start parsing network model
[05/10/2024-12:24:44] [I] [TRT] ----------------------------------------------------------------
[05/10/2024-12:24:44] [I] [TRT] Input filename:   /home/galaxis/projects/amin/jetson-inference/data/pytorch-segmentation/fcn_resnet18.onnx
[05/10/2024-12:24:44] [I] [TRT] ONNX IR version:  0.0.7
[05/10/2024-12:24:44] [I] [TRT] Opset version:    14
[05/10/2024-12:24:44] [I] [TRT] Producer name:    pytorch
[05/10/2024-12:24:44] [I] [TRT] Producer version: 2.0.0
[05/10/2024-12:24:44] [I] [TRT] Domain:           
[05/10/2024-12:24:44] [I] [TRT] Model version:    0
[05/10/2024-12:24:44] [I] [TRT] Doc string:       
[05/10/2024-12:24:44] [I] [TRT] ----------------------------------------------------------------
[05/10/2024-12:24:44] [I] Finish parsing network model
[05/10/2024-12:24:48] [I] [TRT] ---------- Layers Running on DLA ----------
[05/10/2024-12:24:48] [I] [TRT] [DlaLayer] {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]}
[05/10/2024-12:24:48] [I] [TRT] ---------- Layers Running on GPU ----------
[05/10/2024-12:24:50] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +260, GPU +190, now: CPU 650, GPU 5970 (MiB)
[05/10/2024-12:24:50] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +80, now: CPU 732, GPU 6050 (MiB)
[05/10/2024-12:24:50] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/10/2024-12:24:56] [I] [TRT] Total Activation Memory: 32512303104
[05/10/2024-12:24:56] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[05/10/2024-12:24:57] [I] [TRT] Total Host Persistent Memory: 96
[05/10/2024-12:24:57] [I] [TRT] Total Device Persistent Memory: 0
[05/10/2024-12:24:57] [I] [TRT] Total Scratch Memory: 0
[05/10/2024-12:24:57] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 26 MiB, GPU 22 MiB
[05/10/2024-12:24:57] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[05/10/2024-12:24:57] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.051169ms to assign 2 blocks to 2 nodes requiring 4206592 bytes.
[05/10/2024-12:24:57] [I] [TRT] Total Activation Memory: 4206592
[05/10/2024-12:24:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +26, GPU +0, now: CPU 26, GPU 0 (MiB)
[05/10/2024-12:24:57] [I] Engine built in 14.6812 sec.
[05/10/2024-12:24:57] [I] [TRT] Loaded engine size: 26 MiB
[05/10/2024-12:24:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +26, GPU +0, now: CPU 26, GPU 0 (MiB)
[05/10/2024-12:24:57] [I] Engine deserialized in 0.00459798 sec.
[05/10/2024-12:24:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +4, now: CPU 26, GPU 4 (MiB)
[05/10/2024-12:24:57] [I] Setting persistentCacheLimit to 0 bytes.
[05/10/2024-12:24:57] [I] Using random values for input input_0
[05/10/2024-12:24:57] [I] Created input binding for input_0 with dimensions 1x3x512x1024
[05/10/2024-12:24:57] [I] Using random values for output output_0
[05/10/2024-12:24:57] [I] Created output binding for output_0 with dimensions 1x12x16x32
[05/10/2024-12:24:57] [I] Starting inference
[05/10/2024-12:25:00] [I] Warmup completed 10 queries over 200 ms
[05/10/2024-12:25:00] [I] Timing trace has 145 queries over 3.05476 s
[05/10/2024-12:25:00] [I] 
[05/10/2024-12:25:00] [I] === Trace details ===
[05/10/2024-12:25:00] [I] Trace averages of 10 runs:
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9335 ms - Host latency: 21.4217 ms (enqueue 0.418831 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.882 ms - Host latency: 21.372 ms (enqueue 0.459833 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9385 ms - Host latency: 21.4352 ms (enqueue 0.355542 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.8787 ms - Host latency: 21.365 ms (enqueue 0.490649 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9028 ms - Host latency: 21.4042 ms (enqueue 0.437756 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9199 ms - Host latency: 21.4204 ms (enqueue 0.409814 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.882 ms - Host latency: 21.3914 ms (enqueue 0.393835 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.936 ms - Host latency: 21.439 ms (enqueue 0.41676 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.8821 ms - Host latency: 21.3883 ms (enqueue 0.431421 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9209 ms - Host latency: 21.4322 ms (enqueue 0.465869 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9178 ms - Host latency: 21.425 ms (enqueue 0.431006 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9407 ms - Host latency: 21.4461 ms (enqueue 0.395068 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 21.0056 ms - Host latency: 21.5125 ms (enqueue 0.414478 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9628 ms - Host latency: 21.4675 ms (enqueue 0.46853 ms)
[05/10/2024-12:25:00] [I] 
[05/10/2024-12:25:00] [I] === Performance summary ===
[05/10/2024-12:25:00] [I] Throughput: 47.467 qps
[05/10/2024-12:25:00] [I] Latency: min = 21.3491 ms, max = 21.6797 ms, mean = 21.4235 ms, median = 21.4022 ms, percentile(90%) = 21.5027 ms, percentile(95%) = 21.571 ms, percentile(99%) = 21.6519 ms
[05/10/2024-12:25:00] [I] Enqueue Time: min = 0.251709 ms, max = 0.654846 ms, mean = 0.43047 ms, median = 0.422363 ms, percentile(90%) = 0.566406 ms, percentile(95%) = 0.595459 ms, percentile(99%) = 0.63623 ms
[05/10/2024-12:25:00] [I] H2D Latency: min = 0.468445 ms, max = 0.602295 ms, mean = 0.495972 ms, median = 0.494629 ms, percentile(90%) = 0.51416 ms, percentile(95%) = 0.517578 ms, percentile(99%) = 0.540039 ms
[05/10/2024-12:25:00] [I] GPU Compute Time: min = 20.861 ms, max = 21.1443 ms, mean = 20.9222 ms, median = 20.908 ms, percentile(90%) = 21.0013 ms, percentile(95%) = 21.0591 ms, percentile(99%) = 21.1084 ms
[05/10/2024-12:25:00] [I] D2H Latency: min = 0.00415039 ms, max = 0.00634766 ms, mean = 0.00532332 ms, median = 0.00524902 ms, percentile(90%) = 0.00561523 ms, percentile(95%) = 0.00579834 ms, percentile(99%) = 0.00610352 ms
[05/10/2024-12:25:00] [I] Total Host Walltime: 3.05476 s
[05/10/2024-12:25:00] [I] Total GPU Compute Time: 3.03372 s
[05/10/2024-12:25:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/10/2024-12:25:00] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=/home/galaxis/projects/amin/jetson-inference/data/pytorch-segmentation/fcn_resnet18.onnx --fp16 --useDLACore=0 --allowGPUFallback
dusty-nv commented 6 months ago

@khsafkatamin not sure haven't tried those on DLA, it doesn't support all the layers it seems. You could check DeepStream for other models working with DLA.

khsafkatamin commented 6 months ago

@dusty-nv Thank you for your prompt response and suggestion. I will look into deepstream.

But one thing that, I set allowGPUFallback=true, but it still shows the error. do you know anything about that?