Open hseok-oh opened 1 year ago
@hseok-oh Are you going to support hybrid quantization kernel for the followings?
- BatchMatMul
- LSTM
- RNN
If so, may I ask why, specifically to run what model?
Are you going to support hybrid quantization kernel for the followings? If so, may I ask why, specifically to run what model?
It just lists based on operator spec, not model requirement.
Please refer compiler's quantizer issue: #9535
I would like to check the details for weights quantization.
uint8
or int8
?restricted
range (also called as narrow
) or full
range ?int8
restricted range looks reasonable at this moment.
(which is HW-friendly, e.g. neon optimization)
See https://www.tensorflow.org/lite/performance/quantization_spec.
CONV_2D
, DEPTH_WISE_CONV
and FULLY_CONNECTED
supports this kind.
We may need to implement FullyConnected
for int8
restricted range, which seems to use uint8
type. We may choose to use uint8
, but for consistency, and not to make circle-quantizer
complex, it would be good to introduce int8
version.
per-channel
or per-layer
?For this, I have no preference yet.
@hseok-oh, (@chunseoklee) Please give your opinion.
uint8
orint8
?
int8
for weight (hybrid) quantization. Two reason.
uint8
is out-dated quantization. uint8
hybrid quantization is int8
quantization internally.restricted range (also called as narrow) or full range ?
tensorflow/lite/tools/optimize/quantize_weights.cc:601
TfLiteStatus QuantizeWeights(flatbuffers::FlatBufferBuilder* builder,
const Model* input_model,
uint64_t weights_min_num_elements,
bool use_hybrid_evaluation,
QuantizerType quantizer_type) {
// By default we require that only weights with more than
// kWeightsMinSizeDefault elements are quantized.
if (quantizer_type == QuantizerType::MLIR_QUANTIZER) {
return mlir::lite::QuantizeWeights(
builder, input_model, weights_min_num_elements, use_hybrid_evaluation);
}
CustomOpMap custom_op_map;
return QuantizeWeightsInt8(builder, input_model, use_hybrid_evaluation,
weights_min_num_elements, custom_op_map,
kUseUpdatedHybridSchemeDefault);
}
tensorflow/lite/tools/optimize/quantize_weights.cc:415
for (std::pair<int32_t, TensorPerChannel> tensor_pair : tensor_map) {
// Quantize the tensor.
if (tensor_pair.second.is_per_channel) {
TF_LITE_ENSURE_STATUS(utils::SymmetricQuantizeTensorPerChannel(
model.get(), tensor_pair.second.t, tensor_pair.second.channel_dim,
nullptr));
} else {
TF_LITE_ENSURE_STATUS(
utils::SymmetricQuantizeTensor(model.get(), tensor_pair.second.t));
}
}
tensorflow/lite/tools/optimize/quantization_utils.cc:598
// Quantize the input data with respect to channel_dim_index.
TF_LITE_ENSURE_STATUS(SymmetricPerChannelQuantization(
tensor, float_input_data, channel_dim_index, &scales, &final_buffer,
error_reporter));
tensorflow/lite/tools/optimize/quantization_utils.cc:322
// Calculate scales per channel using max and min values from tensor.
std::vector<float> scale_invs(channel_dim_size);
const float half_scale = kMaxQuantizedValue;
for (int channel_idx = 0; channel_idx < channel_dim_size; channel_idx++) {
const float half_range =
std::max(std::abs(tensor->quantization->min[channel_idx]),
std::abs(tensor->quantization->max[channel_idx]));
output_scales->at(channel_idx) = half_range / half_scale;
if (half_range == 0) {
scale_invs[channel_idx] = 0;
} else {
scale_invs[channel_idx] = half_scale / half_range;
}
}
tensorflow/lite/tools/optimize/quantization_utils.cc:42
kMaxQuantizedValue = 127
[-127, 127]
Weight on per-channel hybird quantization int8
is narrow
range
tensorflow/lite/tools/optimize/quantization_utils.cc:480
float min_value, max_value, scaling_factor;
tensor_utils::SymmetricQuantizeFloats(float_data, num_elements,
quantized_buffer.data(), &min_value,
&max_value, &scaling_factor);
tensorflow/lite/kernels/internal/reference/portable_tensor_utils.h:37
void SymmetricQuantizeFloats(const float* values, const int size,
int8_t* quantized_values, float* min, float* max,
float* scaling_factor) {
PortableSymmetricQuantizeFloats(values, size, quantized_values, min, max,
scaling_factor);
}
tensorflow/lite/kernels/internal/reference/portable_tensor_utils.cc:40
void PortableSymmetricQuantizeFloats(const float* values, const int size,
int8_t* quantized_values, float* min_value,
float* max_value, float* scaling_factor) {
auto minmax = std::minmax_element(values, values + size);
*min_value = *minmax.first;
*max_value = *minmax.second;
PortableSymmetricQuantizeFloats(values, size, quantized_values, *min_value,
*max_value, scaling_factor);
}
void PortableSymmetricQuantizeFloats(const float* values, const int size,
int8_t* quantized_values, float min_value,
float max_value, float* scaling_factor) {
const int32_t kScale = 127;
const float range = std::max(std::abs(min_value), std::abs(max_value));
if (range == 0) {
memset(quantized_values, 0, size * sizeof(int8_t));
*scaling_factor = 1;
return;
}
*scaling_factor = range / kScale;
const float scaling_factor_inv = kScale / range;
for (int i = 0; i < size; ++i) {
const int32_t quantized_value =
static_cast<int32_t>(TfLiteRound(values[i] * scaling_factor_inv));
// Clamp: just in case some odd numeric offset.
quantized_values[i] = static_cast<int8_t>(
std::min(kScale, std::max(-kScale, quantized_value)));
}
}
[-127, 127]
Weight on per-tensor hybird quantization int8
is narrow
range
per-channel
orper-layer
per-layer
first, per-channel
later if we need. (we already have FC per-layer
hybrid kernel)
I investigated more.
(EDT) I updated to fix my mistake.
Input | Weight | |
---|---|---|
FullyConnected | per-tensor, [-128,127] | per-tensor, zero_point = 0, [-127,127] |
Conv2D | per-tensor, [-128,127] | per-axis, zero_point = 0,[-127,127] |
DConv2D | per-tensor, [-128,127] | per-axis, zero_point = 0,[-127,127] |
[^1]: https://www.tensorflow.org/lite/performance/quantization_spec
Supported
Not yet