Samsung / ONE

On-device Neural Engine
Other
428 stars 157 forks source link

[onert] Support hybrid quantization #11047

Open hseok-oh opened 1 year ago

hseok-oh commented 1 year ago

Supported

Not yet

glistening commented 1 year ago

@hseok-oh Are you going to support hybrid quantization kernel for the followings?

  • BatchMatMul
  • LSTM
  • RNN

If so, may I ask why, specifically to run what model?

hseok-oh commented 1 year ago

Are you going to support hybrid quantization kernel for the followings? If so, may I ask why, specifically to run what model?

It just lists based on operator spec, not model requirement.

hseok-oh commented 1 year ago

Please refer compiler's quantizer issue: #9535

glistening commented 1 year ago

I would like to check the details for weights quantization.

1.

int8 restricted range looks reasonable at this moment. (which is HW-friendly, e.g. neon optimization)

See https://www.tensorflow.org/lite/performance/quantization_spec. CONV_2D, DEPTH_WISE_CONV and FULLY_CONNECTED supports this kind.

We may need to implement FullyConnected for int8 restricted range, which seems to use uint8 type. We may choose to use uint8, but for consistency, and not to make circle-quantizer complex, it would be good to introduce int8 version.

2.

For this, I have no preference yet.

@hseok-oh, (@chunseoklee) Please give your opinion.

hseok-oh commented 1 year ago

uint8 or int8 ?

int8 for weight (hybrid) quantization. Two reason.

  1. uint8 is out-dated quantization.
  2. uint8 hybrid quantization is int8 quantization internally.

restricted range (also called as narrow) or full range ?

tensorflow/lite/tools/optimize/quantize_weights.cc:601

TfLiteStatus QuantizeWeights(flatbuffers::FlatBufferBuilder* builder,
                             const Model* input_model,
                             uint64_t weights_min_num_elements,
                             bool use_hybrid_evaluation,
                             QuantizerType quantizer_type) {
  // By default we require that only weights with more than
  // kWeightsMinSizeDefault elements are quantized.
  if (quantizer_type == QuantizerType::MLIR_QUANTIZER) {
    return mlir::lite::QuantizeWeights(
        builder, input_model, weights_min_num_elements, use_hybrid_evaluation);
  }
  CustomOpMap custom_op_map;
  return QuantizeWeightsInt8(builder, input_model, use_hybrid_evaluation,
                             weights_min_num_elements, custom_op_map,
                             kUseUpdatedHybridSchemeDefault);
}

tensorflow/lite/tools/optimize/quantize_weights.cc:415

    for (std::pair<int32_t, TensorPerChannel> tensor_pair : tensor_map) {
      // Quantize the tensor.
      if (tensor_pair.second.is_per_channel) {
        TF_LITE_ENSURE_STATUS(utils::SymmetricQuantizeTensorPerChannel(
            model.get(), tensor_pair.second.t, tensor_pair.second.channel_dim,
            nullptr));
      } else {
        TF_LITE_ENSURE_STATUS(
            utils::SymmetricQuantizeTensor(model.get(), tensor_pair.second.t));
      }
    }

Per-channel

tensorflow/lite/tools/optimize/quantization_utils.cc:598

  // Quantize the input data with respect to channel_dim_index.
  TF_LITE_ENSURE_STATUS(SymmetricPerChannelQuantization(
      tensor, float_input_data, channel_dim_index, &scales, &final_buffer,
      error_reporter));

tensorflow/lite/tools/optimize/quantization_utils.cc:322

  // Calculate scales per channel using max and min values from tensor.
  std::vector<float> scale_invs(channel_dim_size);
  const float half_scale = kMaxQuantizedValue;
  for (int channel_idx = 0; channel_idx < channel_dim_size; channel_idx++) {
    const float half_range =
        std::max(std::abs(tensor->quantization->min[channel_idx]),
                 std::abs(tensor->quantization->max[channel_idx]));
    output_scales->at(channel_idx) = half_range / half_scale;
    if (half_range == 0) {
      scale_invs[channel_idx] = 0;
    } else {
      scale_invs[channel_idx] = half_scale / half_range;
    }
  }

tensorflow/lite/tools/optimize/quantization_utils.cc:42 kMaxQuantizedValue = 127

[-127, 127] Weight on per-channel hybird quantization int8 is narrow range

Per-tensor

tensorflow/lite/tools/optimize/quantization_utils.cc:480

  float min_value, max_value, scaling_factor;
  tensor_utils::SymmetricQuantizeFloats(float_data, num_elements,
                                        quantized_buffer.data(), &min_value,
                                        &max_value, &scaling_factor);

tensorflow/lite/kernels/internal/reference/portable_tensor_utils.h:37

void SymmetricQuantizeFloats(const float* values, const int size,
                             int8_t* quantized_values, float* min, float* max,
                             float* scaling_factor) {
  PortableSymmetricQuantizeFloats(values, size, quantized_values, min, max,
                                  scaling_factor);
}

tensorflow/lite/kernels/internal/reference/portable_tensor_utils.cc:40

void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                     int8_t* quantized_values, float* min_value,
                                     float* max_value, float* scaling_factor) {
  auto minmax = std::minmax_element(values, values + size);
  *min_value = *minmax.first;
  *max_value = *minmax.second;

  PortableSymmetricQuantizeFloats(values, size, quantized_values, *min_value,
                                  *max_value, scaling_factor);
}

void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                     int8_t* quantized_values, float min_value,
                                     float max_value, float* scaling_factor) {
  const int32_t kScale = 127;
  const float range = std::max(std::abs(min_value), std::abs(max_value));
  if (range == 0) {
    memset(quantized_values, 0, size * sizeof(int8_t));
    *scaling_factor = 1;
    return;
  }
  *scaling_factor = range / kScale;
  const float scaling_factor_inv = kScale / range;
  for (int i = 0; i < size; ++i) {
    const int32_t quantized_value =
        static_cast<int32_t>(TfLiteRound(values[i] * scaling_factor_inv));
    // Clamp: just in case some odd numeric offset.
    quantized_values[i] = static_cast<int8_t>(
        std::min(kScale, std::max(-kScale, quantized_value)));
  }
}

[-127, 127] Weight on per-tensor hybird quantization int8 is narrow range

hseok-oh commented 1 year ago

per-channel or per-layer

per-layer first, per-channel later if we need. (we already have FC per-layer hybrid kernel)

glistening commented 1 year ago

I investigated more.

(EDT) I updated to fix my mistake.

TensorFlow Lite Quantization Spec

Input Weight
FullyConnected per-tensor, [-128,127] per-tensor, zero_point = 0, [-127,127]
Conv2D per-tensor, [-128,127] per-axis, zero_point = 0,[-127,127]
DConv2D per-tensor, [-128,127] per-axis, zero_point = 0,[-127,127]

[^1]: https://www.tensorflow.org/lite/performance/quantization_spec

``` CONV_2D Input 0: data_type : int8 range : [-128, 127] granularity: per-tensor Input 1 (Weight): data_type : int8 range : [-127, 127] granularity: per-axis (dim = 0) restriction: zero_point = 0 Input 2 (Bias): data_type : int32 range : [int32_min, int32_max] granularity: per-axis restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) Output 0: data_type : int8 range : [-128, 127] granularity: per-tensor DEPTHWISE_CONV_2D Input 0: data_type : int8 range : [-128, 127] granularity: per-tensor Input 1 (Weight): data_type : int8 range : [-127, 127] granularity: per-axis (dim = 3) restriction: zero_point = 0 Input 2 (Bias): data_type : int32 range : [int32_min, int32_max] granularity: per-axis restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) Output 0: data_type : int8 range : [-128, 127] granularity: per-tensor FULLY_CONNECTED Input 0: data_type : int8 range : [-128, 127] granularity: per-tensor Input 1 (Weight): data_type : int8 range : [-127, 127] granularity: per-tensor restriction: zero_point = 0 Input 2 (Bias): data_type : int32 range : [int32_min, int32_max] granularity: per-tensor restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) Output 0: data_type : int8 range : [-128, 127] granularity: per-tensor ```

ONERT Implementation

TensorFlow Lite Implementation