NVIDIA-AI-IOT / cuDLA-samples

YOLOv5 on Orin DLA
Other
167 stars 17 forks source link
cudla dla yolo

YOLOv5 cuDLA sample

This sample demonstrates QAT training&deploying YOLOv5s on Orin DLA, which includes:

  1. YOLOv5s QAT training. - see export
  2. Deploy YOLOv5s QAT model with and cuDLA hybrid mode and cuDLA standalone mode.
    • Convert QAT model to PTQ model and INT8 calibration cache. - see export
    • Build DLA standalone loadable with TensorRT(INT8/FP16). - see data/model
    • Load and run the DLA loadable with cuDLA. - see src
  3. Validate DLA performance and accuracy on COCO 2017 val dataset. - see test_coco_map

image

Prerequisite

sudo apt update
sudo apt install libopencv-dev libjsoncpp-dev python3-pip git git-lfs

# If want to mAP benchmark with COCO dataset, download COCO tool and dataset 
pip3 install pycocotools
cd data/
bash download_coco_validation_set.sh

# cmake >= 3.18
# if pre-installed cmake is lower than 3.18, src/matx_reformat/build_matx_reformat.sh
# will install it for you

YOLOv5 QAT Training and ONNX Export

Refer to export/README.md.

Build and Run

git clone --recursive https://github.com/NVIDIA-AI-IOT/cuDLA-samples.git

If your OS version is less than Drive OS 6.0.8.0 or Jetpack 6.0, please apply trtexec-dla-standalone-trtv8.5.patch(for trt 8.5, for other version you may need to apply it manually) to trtexec and re-built.

cp data/trtexec-dla-standalone-trtv8.5.patch /usr/src/tensorrt/
cd /usr/src/tensorrt/
git apply trtexec-dla-standalone-trtv8.5.patch
cd samples/trtexec
sudo make

Build loadable and compile matx reformat lib

# Build INT8 and FP16 loadable from ONNX in this project
bash data/model/build_dla_standalone_loadable.sh
# Build matx used in pre-/post-processing
bash src/matx_reformat/build_matx_reformat.sh

Run the sample with cuDLA hybrid mode

make clean
# Run INT8 inference on single image
make run
# Or run COCO validation
make validate_cudla_int8 # or make validate_cudla_fp16

Run the sample with cuDLA standalone mode

# "make clean" is needed when switch between hybrid mode and standalone mode
make clean
# Run INT8 inference on single image
make run USE_DLA_STANDALONE_MODE=1
# Or run COCO validation
make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1
# or make validate_cudla_fp16 USE_DLA_STANDALONE_MODE=1

Run the sample with cuDLA standalone mode with deterministic semaphore, this is for run the sample on some old DriveOS(we test 6.0.6.0) and Jetpack.

# "make clean" is needed when switch between hybrid mode and standalone mode
make clean
# Run INT8 inference on single image
make run USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
# Or run COCO validation
make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
# or make validate_cudla_fp16 USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1

mAP over COCO 2017

YOLOv5s Official Data DLA FP16 DLA INT8 QAT GPU INT8 QAT
mAP 37.4 37.5 37.1 36.8

Note:

  1. We use inference resolution of 1x3x672x672 to get this mAP.
  2. Fallback the last 4 layers to FP16 in the last head can increase mAP from 37.1 to 37.3, but the perf will drop little from 4.0ms to 4.46ms. This can be tested with a new loadble built by bash data/model/build_dla_standalone_loadable_v2.sh

Performance

Platform GPU clock Memory clock DLA clock TensorRT Version DLA Version
Orin-X 1275 MHz 3200 MHz 1331 MHz 8.6 3.14
Batch Size DLA INT8(int8:hwc4 in + fp16:chw16 out) (ms) GPU INT8(int8:chw32 in + fp16:chw16 out) (ms)
1 3.82 1.82
2 7.68 2.91
4 15.17 4.99
8 30.92 9.19
12 46.71 13.27
16 62.54 16.87

cuDLA Hybrid Mode and cuDLA Standalone Mode

This sample demonstrates how to use cuDLA hybrid mode and cuDLA standalone mode for a CUDA->cuDLA->CUDA pipeline. More details on cuDLA hybrid mode and cuDLA standalone mode can be found at https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-model.

Using cuDLA hybrid mode allows quick integration with other CUDA tasks, all we need to do is register CUDA memory to cuDLA.

dla_hybrid_mode

Use cuDLA standalone mode can prevent the CUDA context creation, and thus improve the parallelism with other GPU task. cuDLA's standalone mode make use of NvSci to finish the data transfer and synchronization with other modules like camera, GPU or CPU.

dla_standalone_mode

Our cuDLA hybrid mode context code and standalone mode context code has no other dependencies from the sample, thus it can be integrated to user's application quickly. Just copy the src/cuda_context_hybird. or src/cuda_context_standalone. to your own project, add necessary include path and link libraries(Check ./Makefile). then you can make use of our code directly.

Notes

I/O Format
INT8 Input kDLA_LINEAR,kDLA_HWC4,kCHW32
FP16 Input kDLA_LINEAR,kCHW16
INT8 Output kDLA_LINEAR,kCHW32
FP16 Output kDLA_LINEAR,kCHW16

Reference Link

https://github.com/NVIDIA/Deep-Learning-Accelerator-SW
https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference