Int8-inference on Tensor Core

AlexeyAB / yolo2_light

Light version of convolutional neural network Yolo v3 & v2 for objects detection with a minimum of dependencies (INT8-inference, BIT1-XNOR-inference)

MIT License

301 stars 116 forks source link

Int8-inference on Tensor Core #18

Open daniel89710 opened 5 years ago

daniel89710 commented 5 years ago

I tried quantized YOLOv3 on Volta GPU. But, it didn't seem to be run on Tensor Core. CUDNN documentation in 2.8.2 recommends to use "CUDNN_DATA_INT8x32 " for Tensor Core operations. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html

I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?

AlexeyAB commented 5 years ago

I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?

I think yes, should be used CUDNN_DATA_INT8x32 and is supported only Xavier GPU CC 7.2, are not supported Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

It seems Tensor Cores for INT-8 supported only for CC 7.2: Jetson Xavier, DRIVE PX Xavier/Pegasus with Xavier SoC https://en.wikipedia.org/wiki/CUDA#GPUs_supported As I understand, it doesn't support: Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)
is required to use here CUDNN_DATA_INT8x32 if channels % 32 == 0:
- https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/additionally.c#L1690-L1695
- https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/yolov2_forward_network_gpu.cu#L604

daniel89710 commented 5 years ago

Thank you for your reply. I tried on Tesla V100. So, I will try on Xavier. Should I change only CUDNN_DATA_INT8x4 into CUDNN_DATA_INT8x32?

AlexeyAB commented 5 years ago

@daniel89710 I added fix, update your code from GitHub, and un-comment these 2 lines:

So for layers with channels multiple of 32 will be used Tensor Cores (Xavier), for layers with channels multiple of 4 will be used DP4A (Pascal and higher).

Also check that you use cuDNN >= 7.2

daniel89710 commented 5 years ago

@AlexeyAB Thank you for quick update. I will check.

daniel89710 commented 5 years ago

Hi, @AlexeyAB

I tried your codes on Xavier after update code and un-comment 2 lines. However, I got a failure at checkCUDNN (transform_status) in yolo2_foward_newtork_gpu.cu :L626. Error: 3 - CUDNN_STATUS_BAD_PARAM.

Do you have any ideas?

AlexeyAB commented 5 years ago

@daniel89710 Hi,

This is very similar to a bug in cuDNN.

I tried cuDNN 7.3.1 for CUDA 10 + CUDA10 + MSVS2015 on Windows 7 x64. And I even can't create any descriptor for CUDNN_DATA_INT8x32. It always returns (desc_status==CUDNN_STATUS_BAD_PARAM).

cudnnTensorDescriptor_t desc;

cudnnCreateTensorDescriptor(&desc);

cudnnStatus_t desc_status = 
 cudnnSetTensor4dDescriptor(desc, CUDNN_TENSOR_NCHW_VECT_C, CUDNN_DATA_INT8x32, 128, 128, 3, 3);

AlexeyAB commented 5 years ago

@daniel89710

Also what acceleration do you get by using -quantized flag with default CUDNN_DATA_INT8x4 (without CUDNN_DATA_INT8x32) on Xavier / Volta / Turing compared to FP32 calculation (without -quantized flag) ?

There is written that we should use INT8x32 only to accelerate inference, but there is not written that this is mandatory condition to use Tensor Cores. It seems both INT8x4 and INT8x32 will use Tensor Cores, but INT8x32 will be faster: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

CUDNN_DATA_INT8x32 to accelerate your convolution computation

daniel89710 commented 5 years ago

@AlexeyAB Sorry for late. I tried to use -quantized flag on Xavier and I got a little speed down compared with no -quantized flag. I think we cannot still use Tensor core in the case of using -quantized flag. I checked whether this ran on Tensor core or not using nvprof. But, that seems to be executed on only CUDA core without Tensor core.