What is role of scaling_per_output_channel in QuantReLU?

Xilinx / brevitas

Brevitas: neural network quantization in PyTorch

https://xilinx.github.io/brevitas/

Other

1.2k stars 197 forks source link

What is role of scaling_per_output_channel in QuantReLU? #791

Closed phixerino closed 9 months ago

phixerino commented 10 months ago

I'm looking at MobileNetV1 example and I see that scaling_per_output_channel is True in QuantReLU after the first layer (init_block) and then after each pointwise convolutional layer except for the last stage. On the other hand in ProxylessNAS Mobile14 the scaling_per_output_channel is False after the first layer and then its True after each first 1x1 convolutional layer in ProxylessBlock. So whats the purpose of scaling_per_output_channel? Thank you

Giuseppe5 commented 9 months ago

Similar to what happens for weight scaling, you can have one scale factor for the entire tensor to quantize, or one per each channel of said tensor. Other slicing of the tensor to compute scale factors are also possible, although arguably less common (e.g., per-row, per-group, etc.).

The use of per tensor vs per channel depends on the network topology, hardware constraints of the device where you plan to execute your network, and other factors.

As a rule of thumb, the more fine grained the granularity of your scale factors, the better the final accuracy of the quantized network. Similarly, the computational cost and memory usage of your network will increase since scaling factors are stored in high precision.