JEDI

Jetson-aware Embedded Deep learning Inference acceleration framework with TensorRT

JEDI is a simple framework to apply various parallelization techniques on tkDNN-based deep learning applications running on NVIDIA Jetson boards such as NVIDIA Jetson AGX Xavier and NVIDIA Jetson Xavier NX.

The main goal of this tool is applying various parallelization techniques to maximize the throughput of deep learning applications.

If you use JEDI in your research, please cite the following paper.

@article{10.1145/3508391,
author = {Jeong, EunJin and Kim, Jangryul and Ha, Soonhoi},
title = {TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1539-9087},
url = {https://doi.org/10.1145/3508391},
doi = {10.1145/3508391},
journal = {ACM Trans. Embed. Comput. Syst.},
}

Applied Deep Learning Acceleration Techniques

Preprocessing parallelization
Postprocessing parallelization
Intra-network pipelining with GPU and DLA
Stream assignment per each pipelining stage
Intermediate buffer assignment between pipelining stages
Partial network duplication
INT8 quantization on pipelined networks
Batch

FPS Results (Latest)

Test environment: NVIDIA Jetson AGX Xavier (MAXN mode with jetson_clocks), Jetpack 4.3
Input image size: 416x416
The recent experiments are tested with opencv_parallel_num = 0 option.

FP16

Network	Baseline GPU	GPU with JEDI	GPU + DLA with JEDI
Yolov2 relu	74	187	295
Yolov2tiny relu	91	625	701
Yolov3 relu	50	85	128
Yolov3tiny relu	102	614	729
Yolov4 relu	45	81	128
Yolov4tiny relu	103	620	598
Yolov4csp relu	41	94	141
CSPNet relu	40	65	80
Densenet+Yolo relu	44	86	118

INT8

Network	Baseline GPU	GPU with JEDI	GPU + DLA with JEDI
Yolov2 relu	90	401	502
Yolov2tiny relu	96	749	-
Yolov3 relu	67	169	222
Yolov3tiny relu	110	833	-
Yolov4 relu	59	156	216
Yolov4tiny relu	108	810	-
Yolov4csp relu	49	180	233
CSPNet relu	63	145	147
Densenet+Yolo relu	61	186	230

FPS Results (Old)

This result is based on the old version of this software. (The target version is commit )

Test environment: NVIDIA Jetson AGX Xavier (MAXN mode with jetson_clocks), Jetpack 4.3
Input image size: 416x416

Network	Baseline GPU (FP16)	GPU with parallelization techniques (FP16)	GPU + DLA pipelining (FP16)
Yolov2 relu	74	193	291
Yolov3 relu	50	87	133
Yolov4 relu	43	73	90
Yolov4tiny relu	103	459	504
CSPNet relu	40	62	72
Densenet+Yolo relu	44	86	120

Supported Platforms

NVIDIA Jetson boards are supported. (Tested on NVIDIA Jetson AGX Xavier and NVIDIA Jetson Xavier NX)

Prerequisite

Forked tkDNN
All dependencies required by tkDNN
Jetpack 4.3 or higher
libconfig++
OpenMP

How to Compile JEDI

After installing the forked version of tkDNN, compile the JEDI with the following commands.

git clone https://github.com/urmydata/tkDNN.git
mkdir build && cd build
cmake ..
make

How to Run JEDI

To run JEDI, the following parameters are needed.

./build/bin/proc -c <JEDI configuration file> -r <JSON result file> -p <tegrastats log> -t <inference time output file>

where

-c <JEDI configration file>: JEDI configuration file (explanation of JEDI configuration file is shown in here)
-r <JSON result file> (optional): Output file of detection results in COCO JSON format.
-p <tegrastats log output file> (optional): Tegrastats log output file during inference which is used for computing the utilization and power.
-t <inference time output file> (optional): The output file which contains the total inference time

Example commands of running JEDI

./build/bin/proc -h                                              # print help message
./build/bin/proc -c sample.cfg -r result.json -p power.log       # an example of running

JEDI Configuration Parameters

JEDI configuration file is based on libconfg format.
sample.cfg is a sample configuration file with detailed explanation of each configuration parameters

How to Add a New Application in JEDI

JEDI provides an inteface to add a new tkDNN-based deep learning application.
Currently, YoloApplication and CenternetApplication are implemented.
1. Write your own deep learning application with the inference application implementation interface
  - readCustomOptions: Add a custom option which is used for this application.
  - createNetwork: Create a tkDNN-based network
  - referNetworkRTInfo: Refer NetworkRT class if any information in this class is needed
  - initializePreprocessing: Initialize preprocessing and input dataset
  - initializePostprocessing: Initialize postprocessing
  - preprocessing: Execute preprocessing
  - postprocessing: Execute postprocessing (batched execution must be performed inside this method)
  - Call order: readCustomOptions => createNetwork => referNetworkRTInfo => initializePreprocessing => initializePostprocessing => preprocessing/postprocessing
  - You can also implement your own dataset with dataset implementation interface
2. Register your application with the following code in your source code.
```
REGISTER_JEDI_APPLICATION([Your application class name]);
```
3. Add your source code to CMakeLists.txt
4. Insert app_type = "[Your application class name]" in the JEDI configuration file.

Supported and Tested Networks

Network	Trained Dataset	Input size	Network cfg	Weights
YOLO v2¹ with relu	COCO 2014 trainval	416x416	cfg	weights
YOLO v2 tiny¹ with relu	COCO 2014 trainval	416x416	cfg	weights
YOLO v3² with relu	COCO 2014 trainval	416x416	cfg	weights
YOLO v3 tiny² with relu	COCO 2014 trainval	416x416	cfg	weights
Centernet⁴ (DLA34 backend)	COCO 2017 train	512x512	-	weights
Cross Stage Partial Network⁷ with relu	COCO 2014 trainval	416x416	cfg	weights
Yolov4⁸ with relu	COCO 2014 trainval	416x416	cfg	weights
Yolov4 tiny⁸ with relu	COCO 2014 trainval	416x416	cfg	weights
Scaled Yolov4¹⁰ with relu	COCO 2017 train	512x512	cfg	weights
Densenet+Yolo⁹ with relu	COCO 2014 trainval	416x416	cfg	weights

References

Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint arXiv:1904.07850 (2019).
Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Wang, Chien-Yao, et al. "CSPNet: A New Backbone that can Enhance Learning Capability of CNN." arXiv preprint arXiv:1911.11929 (2019).
Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).
Bochkovskiy, Alexey, "Yolo v4, v3 and v2 for Windows and Linux" (https://github.com/AlexeyAB/darknet)
Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "Scaled-YOLOv4: Scaling Cross Stage Partial Network." arXiv preprint arXiv:2011.08036 (2020).

cap-lab / jedi

readme