Xilinx / brevitas

Brevitas: neural network quantization in PyTorch
https://xilinx.github.io/brevitas/
Other
1.14k stars 191 forks source link

Export ONNX QOperator #882

Open surajpandey353 opened 6 months ago

surajpandey353 commented 6 months ago

Hi Team Brevitas,

I trying a simple toy model to check how the exported onnx model with QOps looks like. As per the ONNX_export_tutorial.ipynb, you can pass the quantized input to a QuantIdentity layer with attribute return_quant_tensor = True or alternatively set input_quant = Uint8ActPerTensorFloat. I have following toy model,

import brevitas.nn as qnn

model = nn.ModuleList()
model.append(qnn.QuantConv2d(514, 256, kernel_size=1,
                                                         weight_quant= Int8WeightPerChannelFloat, 
                                                         input_quant=Uint8ActPerTensorFloat, 
                                                         output_quant= Uint8ActPerTensorFloat, 
                                                         bias_quant = Int32Bias,
                                                         return_quant_tensor=True))
model.append(qnn.QuantReLU(return_quant_tensor = True))
model.append(qnn.QuantConv2d(256, 256, kernel_size=1,
                                                         weight_quant= Int8WeightPerChannelFloat, 
                                                         input_quant=None, 
                                                         output_quant= Uint8ActPerTensorFloat, 
                                                         bias_quant = Int32Bias,
                                                         return_quant_tensor=True))
model.append(qnn.QuantReLU(return_quant_tensor = True))
model.append(qnn.QuantConv2d(256, 256, kernel_size=1,
                                                         weight_quant= Int8WeightPerChannelFloat, 
                                                         input_quant=None, 
                                                         output_quant= Uint8ActPerTensorFloat, 
                                                         bias_quant = Int32Bias,
                                                         return_quant_tensor=True))
model.append(qnn.QuantReLU(return_quant_tensor = True))
model.append(qnn.QuantConv2d(256, 514, kernel_size=1,
                                                         weight_quant= Int8WeightPerChannelFloat, 
                                                         input_quant=None, 
                                                         output_quant= Uint8ActPerTensorFloat, 
                                                         bias_quant = Int32Bias,
                                                         return_quant_tensor=False))

Ideally as per the model definition, there should one QuantizeLinear before the first layer and no dequantization in between the model. DeQuantizeLinear should be at the end of the layer since the return_quant_tensor = False in the last layer.

But the graph visualization with netron gives DeQuantizeLinear before every QuantReLU op, which I find it weird behaviour, since it receives quant_tensor as input and returns quant tensor. But if I skip the ReLU activation between the convolutions I am getting the right graph with QuantizeLinear before the first layer and DeQuantizeLinear at the end of the last layer.

Dependencies torch 1.13.0 brevitas 0.10.2

Could someone explain why is this behaviour, is it intended or there is something wrong the way I have defined the model.

Thanks!

Giuseppe5 commented 6 months ago

Would you be able to provide the full script to generate the onnx model?

I know that's just few more lines with respect to what you have already put here, but it's just to be sure to replicate exactly what you see.

Many thanks!

surajpandey353 commented 6 months ago

Hi @Giuseppe5,

Here is minimum reproducible code : toy_model.ipynb.zip

Giuseppe5 commented 6 months ago

Thanks for sharing! The structure that you see is due to the fact that in ONNX there is no real QuantReLU op, instead we need to rely on the floating point version of it. This means that the output of QLinearConv has to be dequantized and then re-quantized before we get into the next QLinearConv.

I hope this explains this behaviour.

Barrot commented 5 months ago

Thanks for sharing! The structure that you see is due to the fact that in ONNX there is no real QuantReLU op, instead we need to rely on the floating point version of it. This means that the output of QLinearConv has to be dequantized and then re-quantized before we get into the next QLinearConv.

I hope this explains this behaviour.

The regular ONNX ReLU supports float and integer values: https://github.com/onnx/onnx/blob/main/docs/Operators.md#relu

Giuseppe5 commented 5 months ago

In general, we are working to deprecate support of QOp in favor of QCDQ (#834), so we probably won't change this behavior.

Sorry for any inconvenience.