[Question] Quantized ONNX to Quantized TFLite

Illia-tsar commented 3 months ago

Issue Type

Others

OS

Linux

onnx2tf version number

1.20.0

onnx version number

1.16.1

onnxruntime version number

1.18.1

onnxsim (onnx_simplifier) version number

0.4.33

tensorflow version number

2.17.0

Download URL for ONNX

https://drive.google.com/file/d/1oTmAKn6qh3JTiQ5-ld9d54NAq9qUrsCL/view?usp=sharing

Parameter Replacement JSON

NA

Description

Hello! I need to convert an already quantized ONNX model (using Quantize-Dequantize nodes, QDQ) to a TensorFlow Lite (TFLite) model with 8-bit precision. The key requirement is to preserve the existing quantization parameters from the ONNX model and ensure they are directly translated into the TFLite model without re-quantizing. As a result, I expect to obtain a TFLite model without QDQ nodes, but weights, biases and activations should be adjusted according to scales and zero points in QuantizeLinear/DequantizeLinear from the onnx model. Is it possible to achieve this behaviour with onnx2tf? Am I missing something? Would be grateful to recieve any help. Thanks in advance!

PINTO0309 commented 3 months ago

Not possible at this time.

It is necessary to implement a very special process that directly generates the FlatBuffer format of tflite. Because there is no such conversion scheme in TFLiteConverter. And you are right, because there is no OP in TFLite that behaves the same as ONNX.

I understand that this is somewhat different from your intended conversion flow, but you may want to try the following tools.

https://github.com/alibaba/TinyNeuralNetwork

Illia-tsar commented 3 months ago

@PINTO0309 Thank you for the response! I understand that directly generating the FlatBuffer format for TFLite is necessary and that there's no straightforward conversion scheme using the existing TFLiteConverter. Could you please provide a more detailed explanation or any documentation/references on how to implement this "special process" to directly generate the TFLite model from the ONNX model while preserving the quantization parameters? I'm considering exploring this approach and potentially contributing a Pull Request if it works out. Any guidance or insights would be greatly appreciated!

Illia-tsar commented 3 months ago

I was also thinking on an alternative approach to this task:

Start by iterating over the nodes in the ONNX graph. Identify QuantizeLinear and DequantizeLinear (QDQ) nodes that are applied to weights, biases, and activations.
For weights and biases, retrieve the scales and zero points from the corresponding QuantizeLinear nodes. Adjust the weights and biases by applying these parameters, effectively converting them to int8 format.
Apply a similar process to activations, given that their corresponding scales and zero-points are known.
After applying the necessary transformations, remove all QDQ nodes from the graph. This results in a simplified int8 ONNX graph with the quantization parameters directly embedded in the weights, biases, and activations.
Run this full int8 ONNX graph through the onnx2tf Python library. Since the graph is already quantized, the conversion process should focus on transferring the layers and preserving the data types without any intermediate quantization.

Questions:

Does this approach sound sensible to you?
What are the potential limitations or constraints when applying scales and zero points to activations to ensure they are compatible with TFLite conversion?

Thanks in advance!

PINTO0309 commented 3 months ago

I am too busy to review and reply for a week.

Illia-tsar commented 3 months ago

That's fine. Anyway, thanks!

PINTO0309 / onnx2tf