[onert] Support hybrid quantization

hseok-oh commented 1 year ago

Supported

FullyConnected

Not yet

Conv
DepthwiseConv
BatchMatMul
LSTM
RNN

glistening commented 1 year ago

@hseok-oh Are you going to support hybrid quantization kernel for the followings?

BatchMatMul

LSTM

RNN

If so, may I ask why, specifically to run what model?

hseok-oh commented 1 year ago

Are you going to support hybrid quantization kernel for the followings? If so, may I ask why, specifically to run what model?

It just lists based on operator spec, not model requirement.

hseok-oh commented 1 year ago

Please refer compiler's quantizer issue: #9535

glistening commented 1 year ago

I would like to check the details for weights quantization.

1.

uint8 or int8 ?
restricted range (also called as narrow) or full range ?

int8 restricted range looks reasonable at this moment. (which is HW-friendly, e.g. neon optimization)

See https://www.tensorflow.org/lite/performance/quantization_spec. CONV_2D, DEPTH_WISE_CONV and FULLY_CONNECTED supports this kind.

We may need to implement FullyConnected for int8 restricted range, which seems to use uint8 type. We may choose to use uint8, but for consistency, and not to make circle-quantizer complex, it would be good to introduce int8 version.

2.

per-channel or per-layer ?

For this, I have no preference yet.

@hseok-oh, (@chunseoklee) Please give your opinion.

hseok-oh commented 1 year ago

uint8 or int8 ?

int8 for weight (hybrid) quantization. Two reason.

uint8 is out-dated quantization.
uint8 hybrid quantization is int8 quantization internally.

restricted range (also called as narrow) or full range ?

tensorflow/lite/tools/optimize/quantize_weights.cc:601

TfLiteStatus QuantizeWeights(flatbuffers::FlatBufferBuilder* builder,
                             const Model* input_model,
                             uint64_t weights_min_num_elements,
                             bool use_hybrid_evaluation,
                             QuantizerType quantizer_type) {
  // By default we require that only weights with more than
  // kWeightsMinSizeDefault elements are quantized.
  if (quantizer_type == QuantizerType::MLIR_QUANTIZER) {
    return mlir::lite::QuantizeWeights(
        builder, input_model, weights_min_num_elements, use_hybrid_evaluation);
  }
  CustomOpMap custom_op_map;
  return QuantizeWeightsInt8(builder, input_model, use_hybrid_evaluation,
                             weights_min_num_elements, custom_op_map,
                             kUseUpdatedHybridSchemeDefault);
}

tensorflow/lite/tools/optimize/quantize_weights.cc:415

    for (std::pair<int32_t, TensorPerChannel> tensor_pair : tensor_map) {
      // Quantize the tensor.
      if (tensor_pair.second.is_per_channel) {
        TF_LITE_ENSURE_STATUS(utils::SymmetricQuantizeTensorPerChannel(
            model.get(), tensor_pair.second.t, tensor_pair.second.channel_dim,
            nullptr));
      } else {
        TF_LITE_ENSURE_STATUS(
            utils::SymmetricQuantizeTensor(model.get(), tensor_pair.second.t));
      }
    }

Per-channel

tensorflow/lite/tools/optimize/quantization_utils.cc:598

  // Quantize the input data with respect to channel_dim_index.
  TF_LITE_ENSURE_STATUS(SymmetricPerChannelQuantization(
      tensor, float_input_data, channel_dim_index, &scales, &final_buffer,
      error_reporter));

tensorflow/lite/tools/optimize/quantization_utils.cc:322

  // Calculate scales per channel using max and min values from tensor.
  std::vector<float> scale_invs(channel_dim_size);
  const float half_scale = kMaxQuantizedValue;
  for (int channel_idx = 0; channel_idx < channel_dim_size; channel_idx++) {
    const float half_range =
        std::max(std::abs(tensor->quantization->min[channel_idx]),
                 std::abs(tensor->quantization->max[channel_idx]));
    output_scales->at(channel_idx) = half_range / half_scale;
    if (half_range == 0) {
      scale_invs[channel_idx] = 0;
    } else {
      scale_invs[channel_idx] = half_scale / half_range;
    }
  }

tensorflow/lite/tools/optimize/quantization_utils.cc:42 kMaxQuantizedValue = 127

[-127, 127] Weight on per-channel hybird quantization int8 is narrow range

Per-tensor

tensorflow/lite/tools/optimize/quantization_utils.cc:480

  float min_value, max_value, scaling_factor;
  tensor_utils::SymmetricQuantizeFloats(float_data, num_elements,
                                        quantized_buffer.data(), &min_value,
                                        &max_value, &scaling_factor);

tensorflow/lite/kernels/internal/reference/portable_tensor_utils.h:37

void SymmetricQuantizeFloats(const float* values, const int size,
                             int8_t* quantized_values, float* min, float* max,
                             float* scaling_factor) {
  PortableSymmetricQuantizeFloats(values, size, quantized_values, min, max,
                                  scaling_factor);
}

tensorflow/lite/kernels/internal/reference/portable_tensor_utils.cc:40

void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                     int8_t* quantized_values, float* min_value,
                                     float* max_value, float* scaling_factor) {
  auto minmax = std::minmax_element(values, values + size);
  *min_value = *minmax.first;
  *max_value = *minmax.second;

  PortableSymmetricQuantizeFloats(values, size, quantized_values, *min_value,
                                  *max_value, scaling_factor);
}

void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                     int8_t* quantized_values, float min_value,
                                     float max_value, float* scaling_factor) {
  const int32_t kScale = 127;
  const float range = std::max(std::abs(min_value), std::abs(max_value));
  if (range == 0) {
    memset(quantized_values, 0, size * sizeof(int8_t));
    *scaling_factor = 1;
    return;
  }
  *scaling_factor = range / kScale;
  const float scaling_factor_inv = kScale / range;
  for (int i = 0; i < size; ++i) {
    const int32_t quantized_value =
        static_cast<int32_t>(TfLiteRound(values[i] * scaling_factor_inv));
    // Clamp: just in case some odd numeric offset.
    quantized_values[i] = static_cast<int8_t>(
        std::min(kScale, std::max(-kScale, quantized_value)));
  }
}

[-127, 127] Weight on per-tensor hybird quantization int8 is narrow range

hseok-oh commented 1 year ago

per-channel or per-layer

per-layer first, per-channel later if we need. (we already have FC per-layer hybrid kernel)

glistening commented 1 year ago

I investigated more.

(EDT) I updated to fix my mistake.

TensorFlow Lite Quantization Spec

	Input	Weight
FullyConnected	per-tensor, [-128,127]	per-tensor, zero_point = 0, [-127,127]
Conv2D	per-tensor, [-128,127]	per-axis, zero_point = 0,[-127,127]
DConv2D	per-tensor, [-128,127]	per-axis, zero_point = 0,[-127,127]

[^1]: _{https://www.tensorflow.org/lite/performance/quantization_spec}

``` CONV_2D Input 0: data_type : int8 range : [-128, 127] granularity: per-tensor Input 1 (Weight): data_type : int8 range : [-127, 127] granularity: per-axis (dim = 0) restriction: zero_point = 0 Input 2 (Bias): data_type : int32 range : [int32_min, int32_max] granularity: per-axis restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) Output 0: data_type : int8 range : [-128, 127] granularity: per-tensor DEPTHWISE_CONV_2D Input 0: data_type : int8 range : [-128, 127] granularity: per-tensor Input 1 (Weight): data_type : int8 range : [-127, 127] granularity: per-axis (dim = 3) restriction: zero_point = 0 Input 2 (Bias): data_type : int32 range : [int32_min, int32_max] granularity: per-axis restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) Output 0: data_type : int8 range : [-128, 127] granularity: per-tensor FULLY_CONNECTED Input 0: data_type : int8 range : [-128, 127] granularity: per-tensor Input 1 (Weight): data_type : int8 range : [-127, 127] granularity: per-tensor restriction: zero_point = 0 Input 2 (Bias): data_type : int32 range : [int32_min, int32_max] granularity: per-tensor restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) Output 0: data_type : int8 range : [-128, 127] granularity: per-tensor ```

ONERT Implementation

Conv2D: No hybrid kernel
DConv2D: No hybrid kernel
FullyConnected: Symmetric(=[-127,127] with zeropoint=0) Per-Tensor Weight Kernel. (The activation is symmetric quantized on-the-fly.)

TensorFlow Lite Implementation

Conv2D: Symmetric(=[-127,127] with zeropoint) {Per-Tensor, Per-Channel}
DConv2D: Symmetric Per-Channel
FullyConnected: Symmetric Per-Tensor

Samsung / ONE