Jetson-aware Embedded Deep learning Inference acceleration framework with TensorRT
JEDI is a simple framework to apply various parallelization techniques on tkDNN-based deep learning applications running on NVIDIA Jetson boards such as NVIDIA Jetson AGX Xavier and NVIDIA Jetson Xavier NX.
The main goal of this tool is applying various parallelization techniques to maximize the throughput of deep learning applications.
If you use JEDI in your research, please cite the following paper.
@article{10.1145/3508391,
author = {Jeong, EunJin and Kim, Jangryul and Ha, Soonhoi},
title = {TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1539-9087},
url = {https://doi.org/10.1145/3508391},
doi = {10.1145/3508391},
journal = {ACM Trans. Embed. Comput. Syst.},
}
Network | Baseline GPU | GPU with JEDI | GPU + DLA with JEDI |
---|---|---|---|
Yolov2 relu | 74 | 187 | 295 |
Yolov2tiny relu | 91 | 625 | 701 |
Yolov3 relu | 50 | 85 | 128 |
Yolov3tiny relu | 102 | 614 | 729 |
Yolov4 relu | 45 | 81 | 128 |
Yolov4tiny relu | 103 | 620 | 598 |
Yolov4csp relu | 41 | 94 | 141 |
CSPNet relu | 40 | 65 | 80 |
Densenet+Yolo relu | 44 | 86 | 118 |
Network | Baseline GPU | GPU with JEDI | GPU + DLA with JEDI |
---|---|---|---|
Yolov2 relu | 90 | 401 | 502 |
Yolov2tiny relu | 96 | 749 | - |
Yolov3 relu | 67 | 169 | 222 |
Yolov3tiny relu | 110 | 833 | - |
Yolov4 relu | 59 | 156 | 216 |
Yolov4tiny relu | 108 | 810 | - |
Yolov4csp relu | 49 | 180 | 233 |
CSPNet relu | 63 | 145 | 147 |
Densenet+Yolo relu | 61 | 186 | 230 |
This result is based on the old version of this software. (The target version is commit )
Network | Baseline GPU (FP16) | GPU with parallelization techniques (FP16) | GPU + DLA pipelining (FP16) |
---|---|---|---|
Yolov2 relu | 74 | 193 | 291 |
Yolov3 relu | 50 | 87 | 133 |
Yolov4 relu | 43 | 73 | 90 |
Yolov4tiny relu | 103 | 459 | 504 |
CSPNet relu | 40 | 62 | 72 |
Densenet+Yolo relu | 44 | 86 | 120 |
After installing the forked version of tkDNN, compile the JEDI with the following commands.
git clone https://github.com/urmydata/tkDNN.git
mkdir build && cd build
cmake ..
make
./build/bin/proc -c <JEDI configuration file> -r <JSON result file> -p <tegrastats log> -t <inference time output file>
where
-c <JEDI configration file>
: JEDI configuration file (explanation of JEDI configuration file is shown in here)-r <JSON result file>
(optional): Output file of detection results in COCO JSON format.-p <tegrastats log output file>
(optional): Tegrastats log output file during inference which is used for computing the utilization and power.-t <inference time output file>
(optional): The output file which contains the total inference time./build/bin/proc -h # print help message
./build/bin/proc -c sample.cfg -r result.json -p power.log # an example of running
YoloApplication
and CenternetApplication
are implemented.
readCustomOptions
: Add a custom option which is used for this application.createNetwork
: Create a tkDNN-based networkreferNetworkRTInfo
: Refer NetworkRT class if any information in this class is neededinitializePreprocessing
: Initialize preprocessing and input datasetinitializePostprocessing
: Initialize postprocessingpreprocessing
: Execute preprocessing postprocessing
: Execute postprocessing (batched execution must be performed inside this method)readCustomOptions
=> createNetwork
=> referNetworkRTInfo
=> initializePreprocessing
=> initializePostprocessing
=> preprocessing
/postprocessing
REGISTER_JEDI_APPLICATION([Your application class name]);
app_type = "[Your application class name]"
in the JEDI configuration file.Network | Trained Dataset | Input size | Network cfg | Weights |
---|---|---|---|---|
YOLO v21 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
YOLO v2 tiny1 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
YOLO v32 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
YOLO v3 tiny2 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
Centernet4 (DLA34 backend) | COCO 2017 train | 512x512 | - | weights |
Cross Stage Partial Network7 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
Yolov48 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
Yolov4 tiny8 with relu | COCO 2014 trainval | 416x416 | cfg | weights |
Scaled Yolov410 with relu | COCO 2017 train | 512x512 | cfg | weights |
Densenet+Yolo9 with relu | COCO 2014 trainval | 416x416 | cfg | weights |