Open trustin77 opened 5 years ago
@trustin77 Hi,
I have not seen step by step instructions on how to do this. I used these documentations:
How Float-32 is converted to the INT-8 in the TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
How to use CUDNN_DATA_INT8x4
in cuDNN: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionForward
How to convert CUDNN_TENSOR_NCHW
& INT8
to CUDNN_TENSOR_NCHW_VECT_C
& INT8x4
: https://devtalk.nvidia.com/default/topic/1028139/cudnn/how-to-reduce-time-spent-in-transforming-tensors-using-cudnnv6-0-for-api-cudnntransformtensor-/post/5264978/#5264978
About optimzal input_calibration: https://github.com/AlexeyAB/yolo2_light/issues/24#issuecomment-435361415
Also about quantization:
Yolo v2 INT8 - too high a reduction of accuracy: http://cs231n.stanford.edu/reports/2017/pdfs/808.pdf
optimal quantization is INT 4-bit: https://arxiv.org/abs/1510.00149
XNOR BIT1 quantization - This motivates us to avoid binarization at the first and last layer of a CNN: https://arxiv.org/abs/1603.05279
MobileNet quantization: https://arxiv.org/abs/1712.05877
Quantization of old models: https://arxiv.org/abs/1512.06473
About XNOR: https://arxiv.org/abs/1807.03010
Also about XNOR: https://arxiv.org/abs/1803.05849
Hi, @AlexeyAB
I'd like to know more about how INT8 version is implemented. Is it based on one/more papers? Could you give related links for reference?
Thanks