Open bongjeong opened 4 years ago
dorefa paper says :'Here we assume the output of the previous layer has passed through a bounded activation function h, which ensures r ∈ [0, 1].' But the paper does not specify what a bounded activation h is. I think multiplying activation by 0.1 can reduce the dynamic range of parameters and make the model perform better.I had an internship in megvii and they dealt with activation functions in this way.
I think, DoReFa is fully integer calculation on the layers(without first, last layer). multiplying activation by 0.1 is not quantized format, it need floating point calculation on feature map(my guess). how do you think about it?
yes, I think so. in training time, we use simulation quantization, the activation layer inputs is the dequantize result. it's float format.
activation quantization is not same
In the paper : x(real) is in range[0 ~ 1] : clamp(input, 0, 1) then, quantize(x)
In your implementation: clamp(input * 0.1, 0, 1)