Closed phixerino closed 9 months ago
Similar to what happens for weight scaling, you can have one scale factor for the entire tensor to quantize, or one per each channel of said tensor. Other slicing of the tensor to compute scale factors are also possible, although arguably less common (e.g., per-row, per-group, etc.).
The use of per tensor vs per channel depends on the network topology, hardware constraints of the device where you plan to execute your network, and other factors.
As a rule of thumb, the more fine grained the granularity of your scale factors, the better the final accuracy of the quantized network. Similarly, the computational cost and memory usage of your network will increase since scaling factors are stored in high precision.
I'm looking at MobileNetV1 example and I see that
scaling_per_output_channel
isTrue
in QuantReLU after the first layer (init_block) and then after each pointwise convolutional layer except for the last stage. On the other hand in ProxylessNAS Mobile14 thescaling_per_output_channel
isFalse
after the first layer and then itsTrue
after each first 1x1 convolutional layer inProxylessBlock
. So whats the purpose ofscaling_per_output_channel
? Thank you