cap-lab / jedi

Jetson embedded platform-target deep learning inference acceleration framework with TensorRT
GNU General Public License v2.0
24 stars 5 forks source link

JEDI

Jetson-aware Embedded Deep learning Inference acceleration framework with TensorRT

JEDI is a simple framework to apply various parallelization techniques on tkDNN-based deep learning applications running on NVIDIA Jetson boards such as NVIDIA Jetson AGX Xavier and NVIDIA Jetson Xavier NX.

The main goal of this tool is applying various parallelization techniques to maximize the throughput of deep learning applications.

If you use JEDI in your research, please cite the following paper.

@article{10.1145/3508391,
author = {Jeong, EunJin and Kim, Jangryul and Ha, Soonhoi},
title = {TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1539-9087},
url = {https://doi.org/10.1145/3508391},
doi = {10.1145/3508391},
journal = {ACM Trans. Embed. Comput. Syst.},
}

Applied Deep Learning Acceleration Techniques

FPS Results (Latest)

FP16

Network Baseline GPU GPU with JEDI GPU + DLA with JEDI
Yolov2 relu 74 187 295
Yolov2tiny relu 91 625 701
Yolov3 relu 50 85 128
Yolov3tiny relu 102 614 729
Yolov4 relu 45 81 128
Yolov4tiny relu 103 620 598
Yolov4csp relu 41 94 141
CSPNet relu 40 65 80
Densenet+Yolo relu 44 86 118

INT8

Network Baseline GPU GPU with JEDI GPU + DLA with JEDI
Yolov2 relu 90 401 502
Yolov2tiny relu 96 749 -
Yolov3 relu 67 169 222
Yolov3tiny relu 110 833 -
Yolov4 relu 59 156 216
Yolov4tiny relu 108 810 -
Yolov4csp relu 49 180 233
CSPNet relu 63 145 147
Densenet+Yolo relu 61 186 230

FPS Results (Old)

This result is based on the old version of this software. (The target version is commit )

Network Baseline GPU (FP16) GPU with parallelization techniques (FP16) GPU + DLA pipelining (FP16)
Yolov2 relu 74 193 291
Yolov3 relu 50 87 133
Yolov4 relu 43 73 90
Yolov4tiny relu 103 459 504
CSPNet relu 40 62 72
Densenet+Yolo relu 44 86 120

Index

Supported Platforms

Prerequisite

How to Compile JEDI

After installing the forked version of tkDNN, compile the JEDI with the following commands.

git clone https://github.com/urmydata/tkDNN.git
mkdir build && cd build
cmake ..
make

How to Run JEDI

where

JEDI Configuration Parameters

How to Add a New Application in JEDI

Supported and Tested Networks

Network Trained Dataset Input size Network cfg Weights
YOLO v21 with relu COCO 2014 trainval 416x416 cfg weights
YOLO v2 tiny1 with relu COCO 2014 trainval 416x416 cfg weights
YOLO v32 with relu COCO 2014 trainval 416x416 cfg weights
YOLO v3 tiny2 with relu COCO 2014 trainval 416x416 cfg weights
Centernet4 (DLA34 backend) COCO 2017 train 512x512 - weights
Cross Stage Partial Network7 with relu COCO 2014 trainval 416x416 cfg weights
Yolov48 with relu COCO 2014 trainval 416x416 cfg weights
Yolov4 tiny8 with relu COCO 2014 trainval 416x416 cfg weights
Scaled Yolov410 with relu COCO 2017 train 512x512 cfg weights
Densenet+Yolo9 with relu COCO 2014 trainval 416x416 cfg weights

References

  1. Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  2. Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
  3. Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  4. Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint arXiv:1904.07850 (2019).
  5. Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  6. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  7. Wang, Chien-Yao, et al. "CSPNet: A New Backbone that can Enhance Learning Capability of CNN." arXiv preprint arXiv:1911.11929 (2019).
  8. Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).
  9. Bochkovskiy, Alexey, "Yolo v4, v3 and v2 for Windows and Linux" (https://github.com/AlexeyAB/darknet)
  10. Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "Scaled-YOLOv4: Scaling Cross Stage Partial Network." arXiv preprint arXiv:2011.08036 (2020).