apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.83k stars 3.48k forks source link

[RFC][Quantization] Support quantized models from TensorflowLite #2351

Closed FrozenGene closed 5 years ago

FrozenGene commented 5 years ago

Let me reference @ajtulloch 's comment about quantization workflow firstly:

  1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

  2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

  3. Train the model as usual

  4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

    • calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
    • using activation ranges learned during training (c2/tf).
  5. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

  6. Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of https://github.com/dmlc/tvm/pull/2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

Welcome any feedback.

jnorwood commented 5 years ago

yes, right. The scaling constant computed during training is based on the range of values seen after fused in activations (at least that is true for the tflite quantized models I've looked at). That includes being after the relu6 positive clipping also. During inference, the min and max saturation values are just handling saturation of values seen outside the range expected from the training... whether or not there was a fused in activation operation during training.

zhenhuaw-me commented 5 years ago

It appears to me this would let them simulate smaller than 8 bit quantizations.

If simulating 8 smaller bit is the case, 8 bit should be able to hold activation min/max value.

8 bits could hold. But what the value output_min / output_max is ? I think @jnorwood want to express this point. Because we can not just simply use out_dtype to decide what the value range is. But if we insert clip op in frontend, I think it also could handle. Need some logic to calculate the min / max. see my next comment.

I was saying the It appears to me this would let them simulate smaller than 8 bit quantizations reasoning could be somehow not the only possibility.

zhenhuaw-me commented 5 years ago

During inference, the min and max saturation values are just handling saturation of values seen outside the range expected from the training...

I guess the saturation is exactly what activations (ReLU family) mean, semantically. :)

FrozenGene commented 5 years ago

Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation

Ah, I see. That finally makes sense. So, this is not about activation. This is about what representation one is using for storing the floating point values. For example, if it is 7-bits, we will need the output min/max saturations. Cool, I will add them into the API and add corresponding documentation.

See @jackwish 's comment. As my code calculate_activation_range_uint8 means, only when no activation, we will have the range of data type. i.e. if we don't have activation, we will have 0 - 255 if it is uint8. If we have RELU6, we will have https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L152

So, how about if we are 7-bits, alright, we could also use 8 bits to represent output_min / output_max in conv's compute kernel. i.e. the output_min / output_max is 0 / 255. But in our frontend, we will be like this:

# If we are 7-bits
  if weight_tensor_type == TensorType.UINT7:
     # implement this function  
     output_min, output_max = self.calculate_activation_range_uint7(output_scale, output_zero_point, fused_activation_fn)
     # insert clip
     out = _op.clip(out, output_min, output_max)

That is to say no matter whether we have activation, we will have one clip. If no activation, we will clamp it to 0 / 127. Because we represent it in 0 / 255, this is 8 bits range. If we have activation, for example, RELU6, the code will change too https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L152:

   *act_min = std::max(qmin, quantize(0.0));
   *act_max = std::min(qmax, quantize(6.0));

q_min is 0, q_max is 127.

So, if we decide to insert clip operator in frontend, we could handle fewer 8 bits too.

One potential optimization : If TVM support data type like UINT7, we could do the logic like UINT8, which means we could avoid inserting clip operator in frontend if we have no activation (just set out_dtype be UINT7). But however, i think it shouldn't be the bottleneck.

jnorwood commented 5 years ago

I guess the saturation is exactly what activations (ReLU family) mean, semantically. :)

In the case of the tflite quantized models I've looked at, the batch normalization and relu6 operations in training are fused into the conv operations used during inference. You probably need to fuse the relu6 to match their results.

This paper removes the relu6 and batch norm associated with the depthwise convs in a mobilenet modification. You would still need the min and max values for those depthwise conv operations even though there is no fused activation. So, that is all I was trying to say ... those min and max values are really to saturate the quantization range, rather than representing an activation operation.

https://arxiv.org/pdf/1803.08607.pdf

FrozenGene commented 5 years ago

https://arxiv.org/pdf/1803.08607.pdf

Qualcomm's Way? Let us see the Google's TFLite model: image

We have the quantized model doesn't remove RELU6 in dw conv / conv. I think we should focus on the TFLite's code / TFLite's way.

Come back to Qualcomm's paper, if we decide to support that way, we could also write logic in frontend and insert correct clip operator. However, I think we have no obvious reason to support this way.

jnorwood commented 5 years ago

If no activation, we will clamp it to 0 / 127.

In the tflite quantized conv implementation ( I posted an excerpt from their code previously) the offset is added in prior to the clamping. The tflite quantized models in their repository used uint8 asymmetric quantization with non-zero offsets for activations and weights and int32 for biases . In that case min and max values passed into the quantized conv are always 0 and 255.

It appears to me, though, that someone who wrote that conv code might have also considered supporting return of signed int8 quantized values ... since they provided a signed int32 min saturation value. If signed int8 quantization is a tflite quantization conversion option, then maybe a good idea make sure to cover this case.

The intel quantization uses fixed 0 offset uint8 for activations and fixed 0 offset int8 for weights and fixed 0 int32 for biases. That simplifies the terms of the convolution inner loops ( a lot, as has been discussed here before). It also reflects Intel's avx512 DLBoost hardware int8 capabilities/limitations. So, probably a good idea to support that mode.

FrozenGene commented 5 years ago

In that case min and max values passed into the quantized conv are always 0 and 255.

Not true. When there is activation, the range is not always 0 ~ 255. For example RELU,

     auto quantize = [scale, zero_point](float f) {
    return zero_point + static_cast<int32_t>(TfLiteRound(f / scale));
     };
    *act_min = std::max(qmin, quantize(0.0));
    *act_max = qmax;

We have proved that compute as this way and could make the result the same as TFLite.

jnorwood commented 5 years ago

In the tflite quantized Mobilenet v1, from the repository, the first conv operation has input data with a non-zero offset. The offset is 128. So either provide a conv which uses signed int8 and 0 offset, or do what tflite does and handle it as quantized uint8 convolution with 128 offset value.

You can see the quantization offsets in Netron in the node properties input data

mobilenetv2

jnorwood commented 5 years ago

Not true. When there is activation, the range is not always 0 ~ 255. For example RELU,

I believe tflite extends the quantization range so it always includes 0, as done in the gemmlowp quantization example below. I have dumped my min and max saturation input values from the six quantized tflite models (two mobilenets and four inceptions). They are all 0 and 255.

https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

// Given the min and max values of a float array, return
// reasonable quantization parameters to use for this array.
QuantizationParams ChooseQuantizationParams(float min, float max) {
  // We extend the [min, max] interval to ensure that it contains 0.
  // Otherwise, we would not meet the requirement that 0 be an exactly
  // representable value.
  min = std::min(min, 0.f);
  max = std::max(max, 0.f);
FrozenGene commented 5 years ago

Not true. When there is activation, the range is not always 0 ~ 255. For example RELU,

I believe tflite extends the quantization range so it always includes 0, as done in the gemmlowp quantization example below. I have dumped my min and max saturation input values from the six quantized tflite models (two mobilenets and four inceptions). They are all 0 and 255.

https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

// Given the min and max values of a float array, return
// reasonable quantization parameters to use for this array.
QuantizationParams ChooseQuantizationParams(float min, float max) {
  // We extend the [min, max] interval to ensure that it contains 0.
  // Otherwise, we would not meet the requirement that 0 be an exactly
  // representable value.
  min = std::min(min, 0.f);
  max = std::max(max, 0.f);

I think you maybe don't understand fully of my previous comment. One question I want to ask: Do your quantized models have conv + relu / relu6 like our model? If no, obviously is 0 ~ 255, no matter how many models are. Please see: https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 I and @jackwish have emphasized many times of this function code.

Please construct one quantized model like us: image

I can make sure you will observe another result.

jnorwood commented 5 years ago

I think you maybe don't understand fully of my previous comment. One question I want to ask: Do your quantized models have conv + relu / relu6 like our model? If no, obviously is 0 ~ 255, no matter how many models are. Please see: https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 I and @jackwish have emphasized many times of this function code.

The quantized mobilenet v1 inference model is from the tflite model repository. The training model includes relu6 and batch normalization operations, but these are fused into convolution operations in the inference model, as the Netron diagram shows.

The link you reference shows floating point activation values that would be applied during training. They do represent the range bound that would be expected of the upscaled values in the accumulator in the inference model. However the min and max saturation values passed into the inference quantized convolution are applied after downscale ... I previously provided the code and the link. They are int32 values, not float values. They are applied after both downscale and offset are applied. They are 0..255 even though the scaled up range expected is 0..6 from the fused-in relu6 operation.

If the convolution and relu operations were separate, you would still see 0 and 255 for those min and max values because they are applied after downscale and after offset are applied to the convolution accumulator. The min and max values only function to saturate the downscaled result to the quantized uint8 bit range, avoiding wrap-around overflow/underflow of the 8 bit value if the downscaled accumulator were simply masked to 8 bits.

zhenhuaw-me commented 5 years ago

@jnorwood Have read again the long discussions, I finally understand what you are trying to say. Let me ask this question: considering ReLU6 in float, do you think it is saturating input float values into [0, 6]?

jnorwood commented 5 years ago

I

@jnorwood Have read again the long discussions, I finally understand what you are trying to say. Let me ask this question: considering ReLU6 in float, do you think it is saturating input float values into [0, 6]?

The 0..6.0 float clamping is applied during training if relu6 is used as activation. It may also be used to force the range for creating the downscale constants and offsets that are used in inference. That seems so, from your activation code excerpt.

The gemmlowp example indicates that they always extend a range if it doesn't include 0. I believe their reason was that an exact zero representation is needed in the range... perhaps for padding. I didn't see that in the activation code excerpt, but perhaps that is handled elsewhere.

On the quantized inference side, those min and max values are applied after the downscale and offset are applied, and it seems to me more appropriate to recognize that they are needed for the quantization bits saturation whether or not an activation operation was used in the training model.

I've only seen 0 and 255 for those input min and max values in the six quantized tflite models I've converted. I dumped them all to check.

No, there is no saturation being applied to input values during inference. The input values are uint8 in the tflite models. There is extra info stored in the model file indicating the input range and offset. In some model operations that input info is needed for rescale ... For example in the multiple input concat operations in the inception_v3 model, the input ranges are different, so a rescale is required.

The tf training models associated with the quantized tflite models have activation and bn operations that are effectively fused together with the conv, along with the fake quantization ops. No separate activation nodes appear in the associated inference models.

FrozenGene commented 5 years ago

I think you maybe don't understand fully of my previous comment. One question I want to ask: Do your quantized models have conv + relu / relu6 like our model? If no, obviously is 0 ~ 255, no matter how many models are. Please see: https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 I and @jackwish have emphasized many times of this function code.

The quantized mobilenet v1 inference model is from the tflite model repository. The training model includes relu6 and batch normalization operations, but these are fused into convolution operations in the inference model, as the Netron diagram shows.

The link you reference shows floating point activation values that would be applied during training. They do represent the range bound that would be expected of the upscaled values in the accumulator in the inference model. However the min and max saturation values passed into the inference quantized convolution are applied after downscale ... I previously provided the code and the link. They are int32 values, not float values. They are applied after both downscale and offset are applied. They are 0..255 even though the scaled up range expected is 0..6 from the fused-in relu6 operation.

If the convolution and relu operations were separate, you would still see 0 and 255 for those min and max values because they are applied after downscale and after offset are applied to the convolution accumulator. The min and max values only function to saturate the downscaled result to the quantized uint8 bit range, avoiding wrap-around overflow/underflow of the 8 bit value if the downscaled accumulator were simply masked to 8 bits.

I have emphasized the model diagram is one quantized model. Let me show more detail of the property: image This is to say, not all relu / relu6 can be fused into convolution in TFLite's quantized model in real production environment. MobilenetV1 just one simple reference, we should consider more. Then what is the min / max now? that is previous code https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 does. NOT just simple 0 ~ 255.

jnorwood commented 5 years ago

I'm using the tensorflow tflite quantized model, mobilenet_v1_1.0_224_quant.tflite. from https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

I view it with Netron, which shows no relu6 nodes. It also shows no fused relu6 nodes in the node properties. So... if you are discussing some different model, I can't comment on it without further info how to repeat it.

I dump the min and max parameters passed to the reference implementation of quantized conv, and they are all 0 and 255.

I created a tf branch which automates this dump, https://github.com/jnorwood/tensorflow/tree/tflite_cmake It was last updated 14 days ago. there is a readme.md in https://github.com/jnorwood/tensorflow/blob/tflite_cmake/tensorflow/lite/README_CMAKE.md that shows how I built it using cmake and how to execute the command where I dumped the data, including the min and max parameters. I just ran it again and am attaching the screen capture, showing that all the min and max inputs are 0,255 for that inference model.

Screenshot from 2019-06-18 10-41-59

shoubhik commented 5 years ago

Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

Other non-TVM related links that were used to understand quantization

  • GemmLowP - Doc
  • TFlite reference code

Covered frameworks for now - TFLite and MxNet Target network for now - Inception V3 from TFLite. (I will create one for Mxnet) Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantizedconv2d and dequantize ops is as follows (other quantized* operators will be on the same lines as that of quantized_conv2d)

Op quantize

def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss

  • The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.

Op quantized_conv2d

def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """

    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.

    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.

    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss further

  • This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.

    • First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
    • Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.
  • The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).

Op dequantize

Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.

def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """

We need to add in_dtype in the dequantize op as the calculations will be different, especially the range to use.

zhenhuaw-me commented 5 years ago

We need to add in_dtype in the dequantize op as the calculations will be different, especially the range to use.

Guess the input tensor has such information already?

shoubhik commented 5 years ago

We need to add in_dtype in the dequantize op as the calculations will be different, especially the range to use.

Guess the input tensor has such information already?

@jackwish, the input data is generally an Expr can be Var or IntImm or some other type of Expr. How will i get in_dtype from an Expr?

tqchen commented 5 years ago

Sorry for the delayed reply in this discussion due to recent conference trips. Here are a few thoughts.

Let us put a concise namespace for the quantization dialect. Two possible candidates:

In both cases, they are a dialect of relay, which means by default we do not want to introduce special implementation, but instead will translate them into existing core ops. We need to have a special op_level for these core ops.

I still think we should minimize the number of operators, and directly translate to lower ops if possible. This includes things like quantize/dequantize, and qnn.concat. Please discuss this alternative and list pros and cons.

anijain2305 commented 5 years ago

Thanks @tqchen

Of the two choices, I am inclining towards relay.op.qnn. My hope is that different frameworks converge to same qnn ops. The relay.op.tflite seems to be very specific as of now. I agree that these news ops should have a special op_level.

I am still unclear about where to draw the boundary when to directly translate to lower ops vs creating a newqnn op. For example, if we are going for devices that do not have any FP32 compute units, we might have to create a long sequence of existing Relay ops to approximate the FP32 computation with fixed point/integer computation. So, encapsulating them would be a good idea.

Basically, we need some kind of abstraction that can be shared across frameworks for these framework operations. For now, I was treating this abstraction as a new qnn relay op. The rationale behind this choice is that once we convert from the framework to a Relay graph, we can eyeball the graph and make some sense by reading the graph. Directly translating will lose the readability of the Relay quantized graph.

However given the tradeoffs, we can very well create a new class that can be shared across frameworks. What are your thoughts on this?

tqchen commented 5 years ago

re "we might have to create a long sequence of existing Relay ops to approximate the FP32 computation".

This is certainly a problem for traditional frameworks, but won't be a problem for tvm/relay. Because we has automatic fusion and code generation, the long sequence of ops will be fused again into a single fused op. We can generate code as efficient, sometimes even more efficient(because we can fuse different ops together). So I will always recommend breaking things down to primitive ops if possible.

anijain2305 commented 5 years ago

I completely agree with breaking down into primitive ops. Even the relay.op.qnn should be broken down into primitive ops. If the primitive op does not exist, we will discuss and maybe create one. I understand the Relay fusion part. I am trying to make another point.

I am trying to understand when to directly translate to primitive ops OR create a new qnn op that will be later lowered to primitive ops using a relay pass. If the lowering sequence is very long, it might be better to create a new qnn op.

PS - The first Relay pass that we can run is qlower or qrewrite (can be a part of framework parser as well, if it looks ugly in build_module) and the resulting sequence will only be a sequence of existing relay primitive ops.

zhenhuaw-me commented 5 years ago

For

relay.op.qnn, e.g. relay.op.qnn.conv2d The qnn name is consistent with QNNPack

and

My hope is that different frameworks converge to same qnn ops.

AFAIK, QNNPACK takes the quantization approach of TensorFlow/TFLite. I think that when we talking about op in this scenario, it means the quantization arithmetic formula itself rather than how to translate it into code, which is same for QNNPACK and TensorFlow/TFLite. So I guess one dialect should be enough for them. And, I guess the converge is more reasonable, if, the qnn stands for simply generic quantized nn, but not QNNPACK.

anijain2305 commented 5 years ago

@jackwish Yes, qnn stands for a generic quantized nn, and not QNNPACK. I think @tqchen also means the same thing.

tqchen commented 5 years ago

OK, seems we are converging to qnn. Perhaps we could propose the list of op names

anijain2305 commented 5 years ago

Finally, we are starting to converge :)

I am proposing them on the basis of Resnet network for now.

relay.op.qnn.conv2d relay.op.qnn.dense relay.op.qnn.relu relay.op.qnn.max_pool2d relay.op.qnn.avg_pool2d relay.op.qnn.concat (used in Inception)

relay.op.qnn.quantize relay.op.qnn.dequantize

anijain2305 commented 5 years ago

All of above qnn ops will be lowered to existing Relay primitive ops using some Relay pass (for example, using ForwardRewrite infra). For example - relay.op.qnn.conv2d can be lowered to

fn (%quantized_data: Tensor[(2, 1, 2, 4), uint8], %weight: Tensor[(3, 1, 2, 2), uint8]) -> Tensor[(2, 3, 1, 3), uint8] {
  %0 = nn.conv2d(%quantized_data, %weight, kernel_size=[2, 2], out_dtype="int32")
  %1 = cast(%0, dtype="float32")
  %2 = multiply(%1, 0.25098f)
  %3 = round(%2)
  %4 = cast(%3, dtype="int32")
  %5 = clip(%4, a_min=0, a_max=255)
  cast(%5, dtype="uint8")
}

I have yet to understand what needs to be done with softmax. Will have to look at a quantized model to understand.

zhenhuaw-me commented 5 years ago

I have yet to understand what needs to be done with softmax. Maybe computing softmax in float as it seems that we are not expecting everything in integer (just like your conv2d lowering proposal)?

anijain2305 commented 5 years ago

0001 0002 0003

FrozenGene commented 5 years ago

@anijain2305 Generally Good. About the performance of HW, let us say ARM CPU, For the depthwise convolution, we even could optimize without tensorize. After some work of optimization for int8 using pure TVM schedule without tensorize, we could also beat QNNPACK (some workload we test we even could beyond 50% on ARM64 platform).

However, for normal convolution, without tensorize, it is hard to achieve best performance. When we use tensorize, one thing is that we combine bias_add / requantize into qnn.conv2d to avoid memory access. As @jackwish 's previous investigation, we find it is very important on ARM CPU's performance. So, if we implement it as the diagram, I only concern this thing.

anijain2305 commented 5 years ago

@FrozenGene Thanks for the quick feedback on the design.

I understand the performance concern. Let's try to tackle them in fusion. Fusion already performs compute_inline to bring the computation at right location. Hopefully, with some tagging and with some arm-twisting, we can achieve same tensorize schedule that you are suggesting.

jnorwood commented 5 years ago

I just want to point out, again, that the output_activation_min and output_activation_max are required even if there is no specified activation operation, since they provide saturation to the quantization range ... avoiding overflow error.

Also, if you fuse activation operations during training, prior to the re-quantization, then you gain the extra bit of resolution for quantization. I believe tflite has done this in all their quantized inference models in their repository.

anijain2305 commented 5 years ago

@jnorwood Yes, I understand your point. We can use the clip to saturate the values even if Relu was not fused. It fits in the design and the proposed abstractions.

anijain2305 commented 5 years ago

@tqchen What are your thoughts?

Seems like we are agreeing on the proposed design abstraction. There is a concern of not being able to achieve the best schedule performance. We can try to tackle it with fusion and schedule_tagging.

tqchen commented 5 years ago

Can we elaborate a bit if avg_pool2d, relu is necessary or if they are more of a direct mapping to the standard ops? Do we allow mix of standard ops and qnn ones?

FrozenGene commented 5 years ago

@tqchen, if we use avg_pool2d , we also need to modify it. But the modified code is not much. For example, we should make the sum UInt8 result be Int16 to avoid overflow. In our internal implementation, we use q_avg_pool2d to distinguish avg_pool2d. Relu shouldn’t be modified. However, if we have activation fns, we should have output_min / output_max calculated by calculate_activation_range_uint8 said before, then we insert clip operator.

anijain2305 commented 5 years ago

q_conv2d

anijain2305 commented 5 years ago

@tqchen Added the case for qrelu. (I think the asymmetric lowering can be improved further, but thats not the point).

Similarly for quantized avg pool2d, as @FrozenGene mentioned, we will still need to upcast the tensor to int32 to avoid saturation. Additionally, we would need to handle the zero points.

anijain2305 commented 5 years ago

Do we allow mix of standard ops and qnn ones?

The framework parsed graph might have a mix (as shown in the lowering of qconv2d). But in the relay.build function, my first pass would be quantize_rewrite pass, that will convert all the qnn ops to existing relay ops, resulting in whole graph consisting of only primitive ops.

tqchen commented 5 years ago

I agree that mixed-precision might make avg_pool2d's case a bit tricky. However, assuming that the zero-point won't change, we might just do avg_pool2d(x.astype("i32")).astype("i8").

max_pool2d though should be the same given that the maximum rule is the same regardless of zero point.

Most of the current operator's lowering rule cast back the domain to float then back into i32. As in the case of qnn.relu. This could be quite inefficient. In most cases of the current symmetric quantization, we try to keep everything in i32 as much as possible.

In particular, refer to the current quantization pass, every value could sit in a domain, which could be fixed point with an implied scale, or floating point. Conversion between domains might be necessary and should be conducted in a minimum way. The default way always convert integer domain back to f32 and use f32 to exchange value between layers, which may not not the most efficient way.

anijain2305 commented 5 years ago

In particular, refer to the current quantization pass, every value could sit in a domain, which could be fixed point with an implied scale, or floating point. Conversion between domains might be necessary and should be conducted in a minimum way. The default way always convert integer domain back to f32 and use f32 to exchange value between layers, which may not not the most efficient way.

So, I think we are trying to make 2 things work together here, which are very difficult to merge. The first is to perform the quantization in framework and then convert it to Relay graph. This is what this issue is trying to focus on. The other is to perform the quantization in TVM itself. Your comment that the conversion between two domains should be minimal applies to the entity that quantizes the network. For example, relu, bias_Add etc are all fused in TFLite Conv2d for the same reason.

If we are converting the framework quantized model to Relay graph, then I think we should perform the same computation as defined by the framework quantized graph. If the original graph has domain conversions, then we will have to respect that as well. We can perform some graph optimizations - like remove dequantize followed by quantize if same quantization parameters. I think even with all this inefficiencies, our fusion algorithms and fast kernels should be able to provide better performance than the framework execution of the quantized graph.

Please let me know your thoughts on this.

tqchen commented 5 years ago

You can also view the domain conversion minimization as an optimization pass here. The resulting graph is to some extent equivalent semantically equivalent to the original one that does the conversion to f32 and back and forth. The idea is we can be smarter when lowering qnn ops into the relay sequence.

For example, when lowering the qconv2d -> qrelu sequence, we don't have to convert the result of qconv2d to f32 and then back to i8, they can be represented directly in the i8 domain without having to get back to f32. The mechanism in the current realize might help in this case.

There are also two separation steps in current tvm's quantizer. We always first make the choice(this step was done by other frameworks), and then decide how to best translate to low-level operator(realize stage in quantization). The realize stage in current quantization part would serve as a good reference.

tqchen commented 5 years ago

To elaborate further about the choice of the domain and how it is relatively independent of which operator you would like to perform. Many operators can actually perform the computation using different number representations (domains)

It means how you should represent the number of a certain layer in either of the two ways. We can represent 2.5 by

Each operator in the qnn could take value from either f32 or i8. In the default setting of the current proposal if the value is from f32, it first converts its representation from f32-> i8, then perform the computation internally in i8, then convert back to f32.

So in the default lowering rules you proposed, every quantized operator has three stages(say qnn.relu) convert_to_i8_dom -> relu_in_i8 -> convert_to_fp32_dom. However, when we have two consecutive ops that can perform operations in a different domain, in this case, fixed pt domain, we do not have to convert the domain into f32, then back to i8, instead we can directly do the domain conversion and possibly gain more efficiencies.

anijain2305 commented 5 years ago

Thanks @tqchen for the detailed explanation.

Actually, my proposal is simpler. My qnn.relu does not convert to the three stages that you mentioned. It only performs the relu_int_i8.

The frameworks (atleast TFLite and MxNet) do not go back to FP32 unless the operator is not supported in i8 format or accuracy is very bad in i8.

For example, TFLite qconv2d will translate to qnn.conv2d + qnn.requantize or as you explained conv_in_i8/i32 -> convert_to_int8 domain, but there wont be any FP32.

To complete the picture, suppose the quantized framework graph is (fw stands for framework)

fw.quantize -> fw.qconv2d -> fw.qrelu -> fw.dequantize

The Relay graph would be

qnn.quantize -> qnn.conv2d -> qnn.requantize -> qnn.relu -> qnn.dequantize convert_to_i8 -> conv_in_i8/i32 -> convert_to_i8 -> relu_in_i8 -> convert_to_FP32

Essentially, if the framework does not convert back to FP32 in between, we would not go to FP32.

jnorwood commented 5 years ago

To complete the picture, suppose the quantized framework graph is (fw stands for framework)

fw.quantize -> fw.qconv2d -> fw.qrelu -> fw.dequantize

If you do the qconv2d and qrelu operations sequentially, using their analogous fp operations, the output from qrelu will have the (potentially worse) resolution of the initial qconv2d. So, you need to be careful if you are trying to use the fully sequential, separate operation results as a reference.

I can see that you might want the graph to represent all the operations prior to optimizing the implementation. I just want to point out that the qrelu implementation can avoid the lowered resolution and can be completely cost free by revising the downscale multiplier and zero point of a preceding quantized output operation (qconv2d in this case). It is cost free because the clipping values are required in any case to do the quantized range saturation.

The operation of revising the downscale multiplier of a previous graph operation is also useful to achieve zero cost replacement of the scale normalization operations in the quantized concat operations in the inception models.

anijain2305 commented 5 years ago

I can see that you might want the graph to represent all the operations prior to optimizing the implementation. I just want to point out that the qrelu implementation can avoid the lowered resolution and can be completely cost free by revising the downscale multiplier and zero point of a preceding quantized output operation (qconv2d in this case). It is cost free because the clipping values are required in any case to do the quantized range saturation.

Yes, you are correct. And that's what exactly TFLite does. In the case of fused TFLite conv2d, the conversion will be different

TFLite.conv2d (fused relu)

will be converted to following Relay graph

qnn.conv2d -> nn.bias_add -> qnn.requantize -> clip

In this case, the cost-free conversion is manifested in the clip operation.

We will have to add framework parsers for each framework, and most probably the resulting sequence of operators will be different for each framework.

My example in my last comment was to explain the fp32 and i8 boundaries and domain conversions of my proposal that @tqchen was pointing out.

zhenhuaw-me commented 5 years ago

Several comments :)

Regarding @anijain2305 's ReLU proposal.

The symmetric and asymmetric path may merge into one - the asymmetric - where the zero point for symmetric approach is 0. Actually, this is a bit more complicate regarding the input tensor type, and what is the expected output tensor type, when handling the ReLU family:

input type output type how to handle
int8/uint8 int8/uint8 Clipping out the unwanted value range, taking zero point into consideration
int32 int32 Assuming the int32 is symmetric, such that clipping out the unwanted value range should be fine for ReLU. But, what about ReLU6?
int32 int8/uint8 the scale and zero point of the input and output may take into consideration. This will break into ReLU with input/output type int32 and a Requantization in the proposal. Considering ReLU6, the integer representation of the FP32 6.0 should be calculated, otherwise, we can hardly know the expected output integer value range.

The listed is not necessarily the all. As I stated before, we need to keep in mind how the floating point is represented in integer, and how can we arrange the arithmetic to maintain the floating point computing which is been represented.

Similarly for quantized avg pool2d, as @FrozenGene mentioned, we will still need to upcast the tensor to int32 to avoid saturation. Additionally, we would need to handle the zero points. Zero point it not needed in handling pooling. The UINT8 representation of FP32 doesn't need to update in the semantic of pooling.

It seems that we have put many Quantize/Dequantize to make the quantization ops reusing existing nn ops, either explicitly or implicitly. This could be bad for performance. Maybe some passes need to be introduced to handle I guess.

tqchen commented 5 years ago

OK, given that most of the qnn ops are already in integer domain, we might be just fine. Minimization of requantize is still useful. And in the case when the scale is a power of two, use shift and normalize might be better than float scale and round

zhenhuaw-me commented 5 years ago

Maybe scales are rarely a power of two (I assume you mean values such as 0100b, 0.0010b). They are basically with long fractionals.

Tianqi Chen notifications@github.com于2019年7月7日 周日上午11:08写道:

OK, given that most of the qnn ops are already in integer domain, we might be just fine. Minimization of requantize is still useful. And in the case when the scale is a power of two, use shift and normalize might be better than float scale and round

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/tvm/issues/2351?email_source=notifications&email_token=ABFVHDKDLR7B7VSQ7PAIRCLP6FMYHA5CNFSM4GMOMOS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLDX7Q#issuecomment-508967934, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFVHDKEMYZ7HLOFF6FB7G3P6FMYHANCNFSM4GMOMOSQ .

-- Best Regards Zhenhua