why int8 and not 8 bit fixed point

Thilanka97 commented 5 years ago

@AlexeyAB Hey why did you choose integer 8 bit over 8 bit fixed point ?

Also in this cant we quantize 32 bit floating point weights to 8bit before the inference(separately) and store. And use the stored 8bit weights during the inference ?

Also what did you do to the input image pixel value(and feature map value) precision ? Did you convert them into int8 as well?

I am a bit confused about these points. could you please kindly explain these to me is possible. Thanks in advance !

AlexeyAB commented 5 years ago

@Thilanka97 Hi,

I just tried to reproduce what was done in the TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

And cuDNN is optimized for INT8 or UINT8.

Also what did you do to the input image pixel value(and feature map value) precision ? Did you convert them into int8 as well?

Currently I uses FP32 for the first layer, so I do nothing with input RGB-image. But output of every other layers are quantized by using Entropy Calibration (KL_divergence) Kullback-Leibler divergence.

More: https://github.com/AlexeyAB/yolo2_light/issues/27

Thilanka97 commented 5 years ago

@AlexeyAB Thank you so much I will check the links. Btw do you think it will reduce the accuracy if we quantize the weights to 8-bit before the inference and not during the inference?

Thanks in adavance!

AlexeyAB commented 5 years ago

@Thilanka97 I quantize (FP32 -> INT8) weights before the inference. I just don't save it as 8-bit.

Thilanka97 commented 5 years ago

@AlexeyAB hey, Thanks for the reply. helps alot. As u mentioned in the previous reply, you use 32 floating point in the first layer right ? Does that mean that both weights and input values should be 32 floating? I am asking if you use 32 floating point weights just for the first layer or saving the 8bit weights as 32 float solves? so the the mac operations of the first layer are 32 float operations and mac operations of all other layers are 8 bit int operations ? Am I correct?

@Thilanka97 I quantize (FP32 -> INT8) weights before the inference. I just don't save it as 8-bit.

not saving as 8 bit means, do u save the int8 weights in the form of 32 float?

Thanks in advance!

AlexeyAB commented 5 years ago

@Thilanka97

It loads yolov3.weights with FP32 weights.

Then it quantizes the weights FP32 -> INT8 once during initialization, except 1st and one conv-layer before each [yolo]-layer.

Then during inference it uses INT8 weights and quantize inputs before each conv-layer, so both Weights and Inputs are INT8. (Only 1st and one conv-layer before each [yolo]-layer are FP32 (Weights and Inputs))

Thilanka97 commented 5 years ago

@AlexeyAB Thanks you so much for the explanation. I am working with yolov2 tiny voc. I checked the map with and without quantization for threshold = 0.25. w/o quantization map= 56.2% and with quantization map = 52.51%. These results seems reasonable right?

Thanks in advance!

Thilanka97 commented 5 years ago

@AlexeyAB I have another small question. I when I extracted the weights of tiny yolov2 voc, there is a parameter file for each conv layer called "conv_normalize". There are sets of 3 parameters. For an example for the first layer of tiny yolov2 voc, there are 3x16 parameters. 3 parameters for each of the 16 rows. I got to know that these parameters might be scale, rolling mean and rolling variance. Can you please kindly explain to me what these parameters do? Are they normalizing the weights or are they normalizing the output feature maps? did you quantize these as well ?

Please explain if possible. I have been stuck here for a while now. Thanks in adavance !

Thilanka97 commented 5 years ago

@AlexeyAB sorry to bother you.

for tiny yolov2 voc(in your post tiny-yolo-voc.cfg), has 10 calibration inputs. But there are only 9 conv layers. why are there 10 values ?

Although you said that you use 32FP for the first layer, at the end of the first layer the output is quantized into 8 INT right? I mean the input to the second layer is INT8 right ? And in the second layer the mac operations happen between int8 weights and int8 fmaps and at the end the output fmap of the second layer(which is 8int) is changed to another 8int value(which has another saturation threshold? Am I correct here? This is abit confusing to me.

Also to quantize both weights and fmap values, do you use the same calibrated values? or do you use different threshold values for weights and fmaps?

Thanks in advance!

AlexeyAB commented 5 years ago

Although you said that you use 32FP for the first layer, at the end of the first layer the output is quantized into 8 INT right? I mean the input to the second layer is INT8 right ? And in the second layer the mac operations happen between int8 weights and int8 fmaps and at the end the output fmap of the second layer(which is 8int) is changed to another 8int value(which has another saturation threshold? Am I correct here? This is abit confusing to me.

Yes.

In general: Layer0(FP32) -> output_FP32 -> intput_INT8 -> Layer1(INT8) -> FP32 -> INT8 -> Layer2(INT8) -> ...

Also to quantize both weights and fmap values, do you use the same calibrated values? or do you use different threshold values for weights and fmaps?

For weights and inputs are used different callibration params and different algorithms.

Thilanka97 commented 5 years ago

Thank you so much for the reply.

for tiny yolov2 voc(in your post tiny-yolo-voc.cfg), has 10 calibration inputs. But there are only 9 conv layers. why are there 10 values ?

@AlexeyAB what about this ? Also you wont be needing a calibration input for the first layer right? why are there 10 inputs? I do not understand

Thanks in advance!

AlexeyAB / yolo2_light

why int8 and not 8 bit fixed point #40