Speedup of INT8/XNOR on Tensor Cores far less than claimed

JC-13 commented 5 years ago

I have been testing the speed of my custom trained yolov3-tiny with 4 classes, on a 2080ti (Turing) and a Xavier (Volta). However using xnor or INT8 both have <10% speedup compared to normal fp32. All testing has been done with the same 1080p video file as input. Repo was git pulled yesterday (07/02/19)

MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0 MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0 Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

Any ideas why there is not a significant speedup by using mixed precision?

njgre6 commented 5 years ago

@AlexeyAB I am also fairly interested in this, I was looking at using low precision inference for a real time embedded system with object detection. Would love to know why the above results are not as good as the theoretical?

JC-13 commented 5 years ago

Seems like the real solution is just to use nVidia's trt-yolo app based on tensorRT. Can't comment on the accuracy but the speed was significantly better: Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

AlexeyAB commented 5 years ago

@JC-13 Hi,

ll testing has been done with the same 1080p video file as input. Repo was git pulled yesterday (07/02/19) ... Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

Try to check GPU-usage during detection - looks like your CPU just can't capture more than 205-230 frames per second from videofile. Also there is still not optimal post-processing on CPU. So try to test both repo with image.

Try to update your code from this GitHub, last couple commits.
Try to train your (not Tiny) Full-XNOR-net model 608x608 or 544x544 using this cfg-file yolov3-spp_xnor_obj.cfg.txt and this pre-trained file https://drive.google.com/file/d/1d4CkgR--7bEEN0kWy-osR3kjLVFDIrnl/view?usp=sharing
Then try to test it on your image, and just divide 1000ms / measured ms = to get fps

darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights -thresh 0.15 image2.jpg

MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0 MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0 Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

2080ti (Turing) CC 7.5 and a Xavier (Volta) CC 7.2 (not 7.3)

On 2080ti (Turing) CC 7.5 - Tensor Cores are used for both Float-32/16 and XNOR-net.
On Xavier (Volta) CC 7.2 - Tensor Cores are used only for Float-32/16, so XNOR will not be as fast as possible on Xavier: https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/

It supports CUDA 10 with a compute capability of sm_72.

Also I didn't optimized INT8 for Tensor Cores, because there is bug in cuDNN which should be bypassed in a non-common way (apparently, the TensorRT uses it): https://github.com/AlexeyAB/darknet/issues/407#issuecomment-454248179

AlexeyAB commented 5 years ago

Commits on Feb 12, 2019 is used. Network resolution 608x608 in both cases.

Test commands:

darknet.exe detector test data/coco.data cfg/yolov3-spp.cfg yolov3-spp.weights dog.jpg
darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights image2.jpg

Model	RTX 2070 `CUDNN_HALF=0`, ms	RTX 2070 `CUDNN_HALF=1`, ms	Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision	40.9	27.2	1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision	13.5	13.2	1.0x
Speedup X times	3.0x	2.0x	-

There is still room for optimization.

Used: CUDA 10.0, cuDNN 7.4.2, OpenCV 3.2.0, Windows 7 x64, MSVS 2015 nVidia GPU GeForce RTX 2070 CC7.5 (Turing, TU106) - 7.5 Tflops-SP (Tensor Cores 59.7 Tflops-HP) If CUDNN_HALF=1 is set, then Tensor Cores are used for Floats, otherwise Tensor Cores aren't used for floats. Tensor Cores are used for XNOR in any case, if the CC > = 7.3 on GPU and un-commented: https://github.com/AlexeyAB/darknet/blob/3d9c8530a0aa983225d2607d14b9cb047e63d305/Makefile#L27

This file was used to train the XNOR-model.: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8

XNOR-net training process: chart_yolov3-spp_xnor_obj

LukeAI commented 5 years ago

@AlexeyAB Thanks very much for the above provision of pretrained feature extractor weights - is that using openimages? Would you be kind enough to also share your final yolov3-spp_xnor_obj.weights ? I want to train my own using the above .cfg and pretrained weights but would like to compare to yours as a reference.

AlexeyAB commented 5 years ago

@LukeAI Hi,

I can't share yolov3-spp_xnor_obj.weights But I can share new pre-trained weights: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8 It should give you better mAP for your training.

LukeAI commented 5 years ago

thankyou very much! Are these for openimages? presumably trained at 448x448? Would that transfer ok to 608 x 608?

AlexeyAB commented 5 years ago

@LukeAI It is trained on ImageNet (137 GB, ~1 300 000 images) ILSVRC2012_img_train.tar https://github.com/AlexeyAB/darknet/blob/master/scripts/get_imagenet_train.sh

You can use it for training for Openimages dataset on 608x608.

AlexeyAB / darknet

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365