This sample demonstrates QAT training&deploying YOLOv5s on Orin DLA, which includes:
sudo apt update
sudo apt install libopencv-dev libjsoncpp-dev python3-pip git git-lfs
# If want to mAP benchmark with COCO dataset, download COCO tool and dataset
pip3 install pycocotools
cd data/
bash download_coco_validation_set.sh
# cmake >= 3.18
# if pre-installed cmake is lower than 3.18, src/matx_reformat/build_matx_reformat.sh
# will install it for you
Refer to export/README.md.
git clone --recursive https://github.com/NVIDIA-AI-IOT/cuDLA-samples.git
If your OS version is less than Drive OS 6.0.8.0 or Jetpack 6.0, please apply trtexec-dla-standalone-trtv8.5.patch(for trt 8.5, for other version you may need to apply it manually) to trtexec and re-built.
cp data/trtexec-dla-standalone-trtv8.5.patch /usr/src/tensorrt/
cd /usr/src/tensorrt/
git apply trtexec-dla-standalone-trtv8.5.patch
cd samples/trtexec
sudo make
Build loadable and compile matx reformat lib
# Build INT8 and FP16 loadable from ONNX in this project
bash data/model/build_dla_standalone_loadable.sh
# Build matx used in pre-/post-processing
bash src/matx_reformat/build_matx_reformat.sh
Run the sample with cuDLA hybrid mode
make clean
# Run INT8 inference on single image
make run
# Or run COCO validation
make validate_cudla_int8 # or make validate_cudla_fp16
Run the sample with cuDLA standalone mode
# "make clean" is needed when switch between hybrid mode and standalone mode
make clean
# Run INT8 inference on single image
make run USE_DLA_STANDALONE_MODE=1
# Or run COCO validation
make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1
# or make validate_cudla_fp16 USE_DLA_STANDALONE_MODE=1
Run the sample with cuDLA standalone mode with deterministic semaphore, this is for run the sample on some old DriveOS(we test 6.0.6.0) and Jetpack.
# "make clean" is needed when switch between hybrid mode and standalone mode
make clean
# Run INT8 inference on single image
make run USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
# Or run COCO validation
make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
# or make validate_cudla_fp16 USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
YOLOv5s | Official Data | DLA FP16 | DLA INT8 QAT | GPU INT8 QAT |
---|---|---|---|---|
mAP | 37.4 | 37.5 | 37.1 | 36.8 |
Note:
bash data/model/build_dla_standalone_loadable_v2.sh
Platform | GPU clock | Memory clock | DLA clock | TensorRT Version | DLA Version |
---|---|---|---|---|---|
Orin-X | 1275 MHz | 3200 MHz | 1331 MHz | 8.6 | 3.14 |
Batch Size | DLA INT8(int8:hwc4 in + fp16:chw16 out) (ms) | GPU INT8(int8:chw32 in + fp16:chw16 out) (ms) |
---|---|---|
1 | 3.82 | 1.82 |
2 | 7.68 | 2.91 |
4 | 15.17 | 4.99 |
8 | 30.92 | 9.19 |
12 | 46.71 | 13.27 |
16 | 62.54 | 16.87 |
int8:hwc4 in + int8:chw32 out
then we can get perf of about 2.4ms(bs=1) for DLA INT8, but it will lead to small accuracy drop. We will optimize this in the future.This sample demonstrates how to use cuDLA hybrid mode and cuDLA standalone mode for a CUDA->cuDLA->CUDA pipeline. More details on cuDLA hybrid mode and cuDLA standalone mode can be found at https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-model.
Using cuDLA hybrid mode allows quick integration with other CUDA tasks, all we need to do is register CUDA memory to cuDLA.
Use cuDLA standalone mode can prevent the CUDA context creation, and thus improve the parallelism with other GPU task. cuDLA's standalone mode make use of NvSci to finish the data transfer and synchronization with other modules like camera, GPU or CPU.
Our cuDLA hybrid mode context code and standalone mode context code has no other dependencies from the sample, thus it can be integrated to user's application quickly. Just copy the src/cuda_context_hybird. or src/cuda_context_standalone. to your own project, add necessary include path and link libraries(Check ./Makefile). then you can make use of our code directly.
I/O | Format |
---|---|
INT8 Input | kDLA_LINEAR,kDLA_HWC4,kCHW32 |
FP16 Input | kDLA_LINEAR,kCHW16 |
INT8 Output | kDLA_LINEAR,kCHW32 |
FP16 Output | kDLA_LINEAR,kCHW16 |
https://github.com/NVIDIA/Deep-Learning-Accelerator-SW
https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla
https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference