NVIDIA / Deep-Learning-Accelerator-SW

NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
Other
178 stars 14 forks source link

Deep Learning Accelerator

NVIDIA DLA hardware is a fixed-function accelerator engine targeted for deep learning operations. It’s designed to do full hardware acceleration of convolutional neural networks, supporting various layers such as convolution, deconvolution, fully connected, activation, pooling, batch normalization, and others. NVIDIA’s Orin SoCs feature up to two second-generation DLAs while Xavier SoCs feature up to two first-generation DLAs.

DLA software consists of the DLA compiler and the DLA runtime stack. The offline compiler translates the neural network graph into a DLA loadable binary and can be invoked using NVIDIA TensorRT™, NvMedia-DLA or cuDLA. The runtime stack consists of the DLA firmware, kernel mode driver, and user mode driver.

Reference for more details: DLA product page

Why is DLA essential on Orin?

Here the distribution of DL TOPs between GPU and DLA on a Jetson AGX Orin 64GB, depending on the power mode:

Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
GPU sparse INT8 peak DL performance 171 TOPs 109 TOPs 41 TOPs 14 TOPs
2x DLA sparse INT8 peak performance 105 TOPs 92 TOPs 90 TOPs 40 TOPs
Total Orin peak INT8 DL performance 275 TOPs 200 TOPs 131 TOPs 54 TOPs
Percentage: DLA peak INT8 performance of total Orin peak DL INT8 performance 38% 46% 69% 74%

Note:

DLA Reference Models

In this repo, we will cover a few key DLA-related metrics for standard deep learning model architectures in the context of common reference application implementations.

The goal is to provide a reference baseline about network architectures and how they map to DLA as well as the INT8 accuracy of these networks.

Accuracy on DLA and GPU

Use case Network INT8 Accuracy on Orin’s DLA Layers always running on GPU Instructions
Object Detection RetinaNet ResNeXt-50 mAP OpenImages MLPerf validation set*: 0.3741
(GPU INT8: 0.3740, FP32 reference 0.3757)
NMS (Last node of the network) See RetinaNet ResNeXt-50 section in scripts/prepare_models/README.md
Classification ResNet-50 Top-1 ImageNet 2012*: 75.54%
(GPU INT8: 76.00%, FP32 reference: 76.46%)
Top-K (Last node of the network) See ResNet-50 section in scripts/prepare_models/README.md
Object Detection SSD-ResNet-34 mAP COCO 2017*: 0.21
(GPU INT8: 0.21, FP32 reference 0.20)
NMS (Last node of the network) See SSD-ResNet-34 section in scripts/prepare_models/README.md
Object Detection SSD-MobileNetV1 mAP COCO 2017*: 0.23
(GPU INT8: 0.23, FP32 reference: 0.23)
NMS (Last node of the network) See SSD-MobileNetV1 section in scripts/prepare_models/README.md

*Accuracy measured internally by NVIDIA, there may be slight differences compared to previous MLPerf Inference submissions.

Key takeaways:

More resources:

Structured Sparsity Case Study: Object Detection Accuracy with RetinaNet ResNet-34

DLA on the NVIDIA Orin platform supports Structured Sparsity that offers the opportunity to minimize latency and maximize throughput in production. See the TensorRT documentation for details (note that the listed restrictions may not apply anymore to most recent DLA releases).

Below case study presents that training models for Structured Sparsity is expected to maintain accuracy:

Network Weight mask pattern Waymo Open Dataset Test Accuracy – 2D Detection 2020
IoU=0.50:0.95 (all sizes) @ FP16 on Host
Steps involved
RetinaNet ResNet-34 No mask / Dense 44.7 Trained from scratch for 26 epochs.
RetinaNet ResNet-34 2:4 Sparse 44.8 Sparsified dense model as detailed in sparsity whitepaper, then trained for another 26 epochs.

Orin DLA Performance

DLA Dense performance

2x DLA images per second on a Jetson AGX Orin 64GB in dense operation measured with JetPack 5.1.1, depending on the power mode:

Network Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
RetinaNet ResNeXt-50 (800x800, bs=1) 78 72 71 36
ResNet-34 backbone (1280x768, bs=1) 285 260 255 121
RetinaNet ResNet-34 (1280x768, bs=1) 108 98 96 45
SSD-ResNet-34 (1200x1200, bs=1) 83 76 74 36
ResNet-50 (224x224, bs=2) 2037 1948 1928 1072
SSD-MobileNetV1 (300x300, bs=2) 2664 2506 2472 1313

Key takeaways:

DLA Sparse performance

Generally, Structured Sparsity shows perf improvements over dense operation for dense Convolution layers that are already math-bound. The more math-bound a layer in dense operation, the higher the expected dense->sparse speedup after applying a 2:4 sparsity pattern.

2x DLA images per second on a Jetson AGX Orin 64GB in sparse operation measured with JetPack 5.1.1, depending on the power mode (with dense->sparse speedup):

Network Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
RetinaNet ResNeXt-50 (800x800, bs=1) 102 (1.31x) 96 (1.34x) 95 (1.34x) 51 (1.43x)
ResNet-34 backbone (1280x768, bs=1) 384 (1.35x) 360 (1.38x) 354 (1.39x) 176 (1.45x)
RetinaNet ResNet-34 (1280x768, bs=1) 143 (1.32x) 133 (1.36x) 131 (1.37x) 66 (1.47x)
SSD-ResNet-34 (1200x1200, bs=1) 103 (1.24x) 97 (1.28x) 95 (1.28x) 49 (1.36x)

Just add --sparsity=force to the trtexec commands from scripts/prepare_models/README.md to reproduce.

Key takeaways:

DLA Performance per Watt (Power Efficiency)

DLA & GPU form your perfect team for Deep Learning inference on the SoC. While the GPU delivers the most TOPs in high-power profiles, DLA excels at power efficiency.

Below table shows the Perf/W ratio of DLA relative to the GPU (accelerator power only, perf metric: images per second) on a Jetson AGX Orin 64GB measured with JetPack 5.1.1, depending on the power mode:

Network Weight mask mode Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
RetinaNet ResNeXt-50 (800x800, bs=1) Dense 3.8x 3.1x 2.8x 4.2x
RetinaNet ResNeXt-50 (800x800, bs=1) Sparse 4.7x 4.1x 3.7x 5.6x
ResNet-34 backbone (1280x768, bs=1) Dense 4.3x 3.7x 3.4x 4.9x
ResNet-34 backbone (1280x768, bs=1) Sparse 3.4x 3.0x 2.7x 4.1x
RetinaNet ResNet-34 (1280x768, bs=1) Dense 4.2x 3.5x 3.1x 4.6x
RetinaNet ResNet-34 (1280x768, bs=1) Sparse 3.5x 2.9x 2.6x 3.9x
SSD-ResNet-34 (1200x1200, bs=1) Dense 3.7x 3.0x 2.7x 4.1x
SSD-ResNet-34 (1200x1200, bs=1) Sparse 3.5x 3.3x 2.9x 4.3x
ResNet-50 (224x224, DLA bs=2, GPU bs=256) Dense 3.5x 2.9x 2.6x 3.9x
SSD-MobileNetV1 (300x300, DLA bs=2, GPU bs=128) Dense 4.2x 3.6x 3.2x 5.0x

Key takeaways:

ONNX operators supported on DLA

See operators/README.md for details on ONNX operators already supported on DLA and planned to be supported in future releases.

Setup

Install the Python dependencies with (only supported on x86 hosts):

python3 -m pip install requirements.txt