AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

Closed JC-13 closed 5 years ago

JC-13 commented 5 years ago

I have been testing the speed of my custom trained yolov3-tiny with 4 classes, on a 2080ti (Turing) and a Xavier (Volta). However using xnor or INT8 both have <10% speedup compared to normal fp32. All testing has been done with the same 1080p video file as input. Repo was git pulled yesterday (07/02/19)

image

MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0 MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0 Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

Any ideas why there is not a significant speedup by using mixed precision?

njgre6 commented 5 years ago

@AlexeyAB I am also fairly interested in this, I was looking at using low precision inference for a real time embedded system with object detection. Would love to know why the above results are not as good as the theoretical?

JC-13 commented 5 years ago

Seems like the real solution is just to use nVidia's trt-yolo app based on tensorRT. Can't comment on the accuracy but the speed was significantly better: image Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

AlexeyAB commented 5 years ago

@JC-13 Hi,

ll testing has been done with the same 1080p video file as input. Repo was git pulled yesterday (07/02/19) ... Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

Try to check GPU-usage during detection - looks like your CPU just can't capture more than 205-230 frames per second from videofile. Also there is still not optimal post-processing on CPU. So try to test both repo with image.

  1. Try to update your code from this GitHub, last couple commits.

  2. Try to train your (not Tiny) Full-XNOR-net model 608x608 or 544x544 using this cfg-file yolov3-spp_xnor_obj.cfg.txt and this pre-trained file https://drive.google.com/file/d/1d4CkgR--7bEEN0kWy-osR3kjLVFDIrnl/view?usp=sharing

  3. Then try to test it on your image, and just divide 1000ms / measured ms = to get fps

darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights -thresh 0.15 image2.jpg


MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0 MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0 Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

2080ti (Turing) CC 7.5 and a Xavier (Volta) CC 7.2 (not 7.3)

It supports CUDA 10 with a compute capability of sm_72.


Also I didn't optimized INT8 for Tensor Cores, because there is bug in cuDNN which should be bypassed in a non-common way (apparently, the TensorRT uses it): https://github.com/AlexeyAB/darknet/issues/407#issuecomment-454248179

AlexeyAB commented 5 years ago

Commits on Feb 12, 2019 is used. Network resolution 608x608 in both cases.

Test commands:

Model RTX 2070 CUDNN_HALF=0, ms RTX 2070 CUDNN_HALF=1, ms Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision 40.9 27.2 1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision 13.5 13.2 1.0x
Speedup X times 3.0x 2.0x -

There is still room for optimization.

Used: CUDA 10.0, cuDNN 7.4.2, OpenCV 3.2.0, Windows 7 x64, MSVS 2015 nVidia GPU GeForce RTX 2070 CC7.5 (Turing, TU106) - 7.5 Tflops-SP (Tensor Cores 59.7 Tflops-HP) If CUDNN_HALF=1 is set, then Tensor Cores are used for Floats, otherwise Tensor Cores aren't used for floats. Tensor Cores are used for XNOR in any case, if the CC > = 7.3 on GPU and un-commented: https://github.com/AlexeyAB/darknet/blob/3d9c8530a0aa983225d2607d14b9cb047e63d305/Makefile#L27


This file was used to train the XNOR-model.: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8

XNOR-net training process: chart_yolov3-spp_xnor_obj

LukeAI commented 5 years ago

@AlexeyAB Thanks very much for the above provision of pretrained feature extractor weights - is that using openimages? Would you be kind enough to also share your final yolov3-spp_xnor_obj.weights ? I want to train my own using the above .cfg and pretrained weights but would like to compare to yours as a reference.

AlexeyAB commented 5 years ago

@LukeAI Hi,

I can't share yolov3-spp_xnor_obj.weights But I can share new pre-trained weights: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8 It should give you better mAP for your training.

LukeAI commented 5 years ago

thankyou very much! Are these for openimages? presumably trained at 448x448? Would that transfer ok to 608 x 608?

AlexeyAB commented 5 years ago

@LukeAI It is trained on ImageNet (137 GB, ~1 300 000 images) ILSVRC2012_img_train.tar https://github.com/AlexeyAB/darknet/blob/master/scripts/get_imagenet_train.sh

You can use it for training for Openimages dataset on 608x608.