Samsung / ONE

On-device Neural Engine
Other
410 stars 142 forks source link

[circle-quantizer] Support int8 quantization #12165

Open BalyshevArtem opened 7 months ago

BalyshevArtem commented 7 months ago

What

Let's support int8 quantization in circle-quantizer.

Why

Onert-micro support int8 quantized kernels and contains faster CMSIS-NN kernel, which works with int8 quantization, not uint8. Obtaining int8 models using circle-quantizer will allow using CMSIS-NN cores, as this will simplify obtaining such models by users.

How

TBD:

BalyshevArtem commented 7 months ago

@jinevening Can you please give your opinion, about this feature?

jinevening commented 7 months ago

Support for int8 quantization looks good to me. I think there are two possible approaches.

  1. Implement quantization algorithm in circle-quantizer: CMSIS-NN kernel follows tflite int8 quantization spec (https://www.tensorflow.org/lite/performance/quantization_spec?hl=ko), which is slightly different from our backend (e.g., we allow different scales between input/output of pooling operators). It conflicts with the current implementation of QuantizeWithMinMax. I think it would be better to make a separate pass, e.g., TFLiteQuantizeWithMinMax that quantizes a circle model based on the tflite spec (record-minmax does not have to be changed).

  2. Use TFLiteConverter: We can make a tool to convert circle to tflite, .e.g, circle2tflite, use TFLiteConverter for quantization, and convert the quantized tflite to quantize circle using tflite2circle. We may need an additional tool (or library) to convert our calibration dataset (.h5, list format) into the tflite-convertible format.

I prefer the second approach, because it would require less implementation/maintenance costs than the first approach. I also believe that circle2tflite has its own value, not just for this use case.

seanshpark commented 7 months ago

circle2tflite

this is not recommended by our policy.

jinevening commented 7 months ago

May I ask what is the policy?

seanshpark commented 7 months ago

May I ask what is the policy?

Please have an offline talk with @lemmaa

BalyshevArtem commented 7 months ago
  1. TFLiteQuantizeWithMinMax that quantizes a circle model based on the tflite spec (record-minmax does not have to be changed).

So, let's then use first approach? Creating TFLiteQuantizeWithMinMax to quantize into int8, right? Also, do you know is there any problem in quantization into int16? Is the same policy with tflite?

jinevening commented 7 months ago

I had short conversation with @lemmaa , but couldn't reach to the final decision yet. There are a couple of issues to discuss.

  1. Implementation & maintenance cost
  2. Integration with existing features (mixed precision, verifier, fake quantizer)

@lemmaa If TF int8/int16 is supported, should our quantization features (mixed precision, verifier, fake quantizer) be extended to support those types? I think that would require a lot of efforts but have not much benefit.

@BalyshevArtem I have several questions.

  1. Could you list up operators you want to support in onert-micro? It does not have to be exact.
  2. Can you describe the target model? For example, full integer with single precision, full integer with mixed precision or not full integer with mixed precision.
  3. Does this task have a deadline?

Also, do you know is there any problem in quantization into int16? Is the same policy with tflite?

Our int16 quantizer is slightly different from tflite quantizer. We have an option (ex: --TF-style_maxpool) to sync with tflite quantizer, but AFAIK, that's not enough. You may need to implement a new quantizer (ex: something like int16_tf), or use our quantizer with additional options to sync with tflite quantizer.

BalyshevArtem commented 7 months ago
  1. Could you list up operators you want to support in onert-micro? It does not have to be exact.

There should definitely be all kernels for which there is a cmsis-nn implementation: Conv2D, DepthwiseConv2D, TransposeConv2D, Fully Connected, Add, Mul, MaxPooling, AvgPooling, Softmax, LSTM, SVDF. It would also be useful, in the case of full int8 quantization of the network, to support as many operations as possible that can occur: Reshape, StridedSlice, Gather, Concatenation, and so on.

2. Can you describe the target model? For example, full integer with single precision, full integer with mixed precision or not full integer with mixed precision.

I think: full integer with single precision and not full integer with mixed precision - our main goal for current experiments. Use either a fully quantized int8 model, or a fully float, where the largest parts are quantized into int8, or which have a cmsis-nn implementation.

3. Does this task have a deadline?

No

jinevening commented 7 months ago

@BalyshevArtem Thanks for the reply. I guess that you will be the assignee of this task. Could you give your opinion about the above issue?

BalyshevArtem commented 7 months ago

@BalyshevArtem Thanks for the reply. I guess that you will be the assignee of this task. Could you give your opinion about the above issue?

Implementation & maintenance cost - I think we who work on onert-micro side can implement and maintenance this int8 quantization.

Integration with existing features (mixed precision, verifier, fake quantizer) - It seems to me that we don't need anything right now.

jinevening commented 7 months ago

Thanks for the opinion. So, changes will be limited to

  1. Interface of quantizer (adding int8 as a new quantized_dtype)
  2. int8 quantization pass
  3. Some optimization passes to sync with tflite int16 quantization
  4. onert-micro kernels

We'll not make a model mixed with tflite_int8 and existing uint8/int16 (at least for a while).

One thing to note is that one-quantize has options related to a new quantized_dtype

Please make sure those options do not conflict with the new quantized_dtype, e.g., throw an exception for the new dtype or extend the existing option's behavior.