FrozenGene commented 5 years ago

Let me reference @ajtulloch 's comment about quantization workflow firstly:

Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

(optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

Train the model as usual

Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)

using activation ranges learned during training (c2/tf).

Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of https://github.com/dmlc/tvm/pull/2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

[x] Support TFLite FP32 Relay frontend. PR: https://github.com/dmlc/tvm/pull/2365
[ ] Support TFLite INT8 Relay frontend
[ ] Extend the attribute of the convolution and related ops to support quantization
[ ] Auto-TVM on ARM CPU can work with INT8

Welcome any feedback.

tqchen commented 5 years ago

Starting from TFLite importer to relay sounds great. cc @jroesch @ajtulloch @yzhliu

ZihengJiang commented 5 years ago

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

FrozenGene commented 5 years ago

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

Thanks for reminding. However, I don't fully understand your reminder. Do you mean I should be careful quantize or multiply / add ops? If we import existing quantized model like TFLite, we shouldn't see quantize ops any more.

jnorwood commented 5 years ago

Hi, I recently wrote some code to read in the tflite quantized examples and translate them to nnef output. Their operations are pretty similar to nnvm ops. I translated the two mobilenets and the four inception models. There's a cmake config that pulls down all the models and converts them. Please feel free to use whatever you want from it. I forked the NNEF Tools project, https://github.com/jnorwood and put the converter under the contrib/converters/tflite_converters/tflite_to_nnef

I only added processing for the ops I needed, and I only did quantized data. tflite uses uint8 quantization, btw, with offsets for both weights and features. Biases are int32. NNEF passes quantization configuration in a separate file from the graph. Also, note that tflite uses nhwc everywhere.

anijain2305 commented 5 years ago

@FrozenGene I am interested in contributing to this Issue. Is it possible to share the progress?

FrozenGene commented 5 years ago

Hey, @anijain2305 Thanks for your interest. Currently, I am doing https://github.com/dmlc/tvm/pull/3141. After that, I will start it. BTW, our internal support is based on NNVM and we have completed support it, we have the same result compared with TFLite and have better performance than TFLite. However, I have to spare some time translating to Relay when to make PR. But I have to say that I am busy this month in our product development and it will go to open source progress in my company. I will @ you when that PR is ready.

anijain2305 commented 5 years ago

Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

Discussion

Other non-TVM related links that were used to understand quantization

GemmLowP - Doc
TFlite reference code

Covered frameworks for now - TFLite and MxNet Target network for now - Inception V3 from TFLite. (I will create one for Mxnet) Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantizedconv2d and dequantize ops is as follows (other quantized* operators will be on the same lines as that of quantized_conv2d)

Op quantize

def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss

The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.

Op quantized_conv2d

def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """

    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.

    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.

    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss further

This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.
- First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
- Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.
The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).

Op dequantize

Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.

def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """

FrozenGene commented 5 years ago

@anijain2305

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously. see TFLite's CalculateActivationRangeUint8 function.

From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example, which is used very widely and have common ops we should consider. For example, depthwise convolution / add / pool and so on.

jnorwood commented 5 years ago

From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example,

Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.

Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.

Also, the MobilenetV2 q_add inputs require rescale... but in both q_concat and q_add you can recalculate the prior op downscale multipliers so you can eliminate the extra rescales.

Also, depending on your allocation capabilities, you can get rid of all concats.

zhenhuaw-me commented 5 years ago

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

anijain2305 commented 5 years ago

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.

I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

anijain2305 commented 5 years ago

Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.

Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.

Make sense. For now, I was thinking of not worrying about depth-wise conv. So, decided to take Inception V3 into account. I think given we are in the starting position, I don't have any big inclination towards any network. My motive is to focus on getting the right infrastructure in the start and showcase it with one large network. The performance micro-optimizations can then phased.

anijain2305 commented 5 years ago

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.

FrozenGene commented 5 years ago

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

anijain2305 commented 5 years ago

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?

Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.

The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.

FrozenGene commented 5 years ago

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.
In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?

Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.

The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.

Yes, I agree when we don't have activation, we don't need anything. However, Another thing we should consider: How to integrate with other libraries, such as QNNPACK. QNNPACK also need output min / output max too. https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63

tqchen commented 5 years ago

Here are some points to discuss:

namespace for the tflite quantize style dialect
List of ops that might need tvm's compute declaration
set of possible passes that lower the rest into the core ops

Some of the discussions involve fusion, and that is something where TVM might be able to help. For example, in the current symmetric scheme, clip, relu6, and subsequent downcasting ops are automatically fused into the conv2d ops. While the conv2d op can simply just output int32(because followup ops will get fused).

I agree with @anijain2305 that we could try to get something minimum that is working, then start thinking about possible rewriting rules to get to some useful patterns if we decide that manual intervention is necessary.

Ideally, we should have a generic schedule template that works for any fused patterns, just as those in the current symmetric version, so we do not need to have all the different variants of fused conv2d ops

also cc @vinx13 @ZihengJiang

jnorwood commented 5 years ago

I want to point out that the min and max values you mentioned are not related to the activation range in the original model. They are saturation values. In the case of mobilenet, for example, which has relu_6 use everywhere, I'm printing out the min and max activation values from the tflite mobilenet V2 below. The model uses uint8 downscale between layers, and uses the min and max value to clamp/saturate the values to 0..255 for all layers in that model. The thing it could be used for (but isn't here) is for more or fewer quantization bits or for signed int quantization ... but tflite is using all uint8 quantization for MobilenetV2.

the amin and amax values below are tflite output_activation_min, output_activation_max from their quantized reference ops for conv and dw_conv.

(base) jay@jay-desktop:~/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log ` (base) jay@jay-desktop:~/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log ---------conv in_h=224, in_w=224,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1992157658,shft=-7,amin=0, amax=255 -------dwconv in_h=112, in_w=112,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1254985768,shft=-1,amin=0, amax=255 ---------conv in_h=112, in_w=112,out_h=112,out_w=112,f_h=1,f_w=1,mpy=2090511665,shft=-5,amin=0, amax=255 -------dwconv in_h=112, in_w=112,out_h=56,out_w=56,f_h=3,f_w=3,mpy=1729896231,shft=-1,amin=0, amax=255 ---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=2081950125,shft=-6,amin=0, amax=255 -------dwconv in_h=56, in_w=56,out_h=56,out_w=56,f_h=3,f_w=3,mpy=2080045879,shft=-4,amin=0, amax=255 ---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=1890535782,shft=-6,amin=0, amax=255 -------dwconv in_h=56, in_w=56,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1151606277,shft=-5,amin=0, amax=255 ---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=2089579858,shft=-7,amin=0, amax=255 -------dwconv in_h=28, in_w=28,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1410648286,shft=-4,amin=0, amax=255 ---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=1767908551,shft=-7,amin=0, amax=255 -------dwconv in_h=28, in_w=28,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1850037283,shft=-6,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1260482936,shft=-6,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1269068532,shft=-4,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1456865727,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1464063813,shft=-4,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1364297475,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1948805937,shft=-5,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=2136047634,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1671906928,shft=-5,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1327474777,shft=-6,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1330877207,shft=-5,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1497258311,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1076915935,shft=-6,amin=0, amax=255 ---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1124144746,shft=-6,amin=0, amax=255 -------dwconv in_h=7, in_w=7,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1083785823,shft=-2,amin=0, amax=255 ---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1240259613,shft=-5,amin=0, amax=255 ---------conv in_h=1, in_w=1,out_h=1,out_w=1,f_h=1,f_w=1,mpy=1553319078,shft=-10,amin=0, amax=255

`

jnorwood commented 5 years ago

similarly, for the tflite quantized inception v3 model, all those output_activation_min, output_activation_max are 0 and 255 I'll attach a zip file with the log. inv3.zip

jnorwood commented 5 years ago

to explain a little further ... during training they determine the range of input values, and they determine the downscale multiplier that will shrink the observed range to 0..255 (for the uint8 quantization). The fp downscale multiplier is coverted to integer mpy and right-shift constants, which are the mpy and shft values in my log. At inference time, the downscaled accumulator (after applying the downscale) may be outside the uint8 quantization range, and so they clamp/saturate to that range. In these current models, they are using uint8 quantization ... so the range is 0..255, but it appears to me they are providing the min and max to support other numbers of bits in the quantization. I see support for several 4 bit gpu implementations recently, so maybe this is to support something like that.

zhenhuaw-me commented 5 years ago

Some comments for @anijain2305 's reply :)

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.

A network uses operators (or layers or anything we'd like to call it) regardless of the accumulation format. The format is part of a software system's mechanism. So, I guess we don't need a accumulator_dtype and the out_dtype is what we want. The discussion is about whether we put requantization inside the conv2d op.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.

I was saying extending existing tensor rather than introduce new tensor type. I assume that this won't lead to new Relay opt :)

EDIT: Btw, the channel-wise quantization parameter is likely to be included in TensorFlow/TFLite, also the TVM stack as a roadmap. In this way, it could be easier to manage a tensor described parameter.

zhenhuaw-me commented 5 years ago

Regarding @jnorwood 's comments on output min/max of conv2d.

Your observations about the values of output min max are correct. But they are still activations. One thing I always try to deliver is that, the INT8 values in quantization are a representation of original FP32 values.

When we talking about ReLU6 activations, it means that in FP32 format, the op outputs FP32 values in range [0, 6]. For INT8 quantization, INT8 data is an representation of FP32 value, which means, the output min/max (which is typically [0, 255] of INT8 type in pre-provided quantized MobileNet) are representing [0, 6] of FP32 type - the INT8 0/255 is actually FP32 0/6. Try the output scale (0.023528477177023888) with the activation min/max, we will get value range like [0, 5.999761581420898] (from output of the first conv of the pre-provided quantized MobileNet).

Conclusions can easily draw once we have this in mind :)

anijain2305 commented 5 years ago

I would suggest to design the infrastructure that supports both symmetric/asymmetric quantization. We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

namespace for the tflite quantize style dialect

I think this is required for both asymmetric and symmetric quantization. These ops will be rewritten to low-level instructions by a Relay pass. How about using relay.op._quantization as the namespace? So, the operations can be relay.op._quantization.conv2d or relay.op._quantization.quantize.

List of ops that might need tvm's compute declaration

I am not sure yet. The only unknown to me are the special rounding operations that are used in converting the Floating point to Integer multiplication in scaling the quantized conv matrix. But, they might already be covered in current low-level ops.

set of possible passes that lower the rest into the core ops

I was hoping to re-use the FForwardRewrite infrastructure to lower the ops. Do you anticipate more passes here?

jnorwood commented 5 years ago

We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.

zhenhuaw-me commented 5 years ago

We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.

TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...

anijain2305 commented 5 years ago

This is most probably out of the context of the issue, but is it possible for all of the people commenting here to join a conference call for an hour and figure out the next steps? I can take notes and document them here for everybody else to see. I think it will be more productive.

tqchen commented 5 years ago

re "conference calls". I totally agree that calling or in person sync will speed up reaching consensus. Doing most of the development in the public archivable process is preferred https://docs.tvm.ai/contribute/committer_guide.html#public-archive-principle

We do need to acknowledge the overhead of the asynchronous communication, but should also acknowledge the gains we get by leaving a trace for the broader community. I would encourage us to try to rely more on asynchronous communication in public channels first. The main bottleneck of asynchronous discussion is the overhead of latency and a good way to improve it is to

Here is a possible proposal:

Everyone who are primarily driving this process, send out a proposal
- List of points to be discussed.
- List questions
- List pros and cons of decisions if there is a decision being made.
Every one can critique

We could also use the slack for semi-sync chats, but please note that everything relates to design decision need to be properly sent back to the public channel. I understand that there is more overhead in this approach, but I believe it is a price worth paying to get more people involved.

jnorwood commented 5 years ago

TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...

You might also consider symmetric signed int8 for weights, and unsigned uint8 for for source and destination, since uint8 will give an extra bit of precision following activations. Intel appears to preferentially support this form in their examples, and their new DLBoost avx512 vector instructions also appear to preferentially support this form.

https://intel.github.io/mkl-dnn/ex_int8_simplenet.html

https://www.intel.ai/nervana/wp-content/uploads/sites/53/2018/05/Lower-Numerical-Precision-Deep-Learning-Inference-Training.pdf

These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in 𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑢8) format, the other in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑠8) format with the accumulation in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡32(𝑠32) format.

zhenhuaw-me commented 5 years ago

TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...

You might also consider symmetric signed int8 for weights, and unsigned uint8 for for source and destination, since uint8 will give an extra bit of precision following activations. Intel appears to preferentially support this form in their examples, and their new DLBoost avx512 vector instructions also appear to preferentially support this form.

https://intel.github.io/mkl-dnn/ex_int8_simplenet.html

https://www.intel.ai/nervana/wp-content/uploads/sites/53/2018/05/Lower-Numerical-Precision-Deep-Learning-Inference-Training.pdf

These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in 𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑢8) format, the other in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑠8) format with the accumulation in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡32(𝑠32) format.

I am sorry, but I fail to get the reasoning between your comment uint8 will give an extra bit of precision following activations, and the material you listed. Would you please make it a bit more clear? AFAIK, uint8 and int8 has same value capacity, so there could be no extra precision.

eqy commented 5 years ago

@jackwish If relu activations are used, there is no need to use half of the representation space for negative values; thus the extra bit of precision.

zhenhuaw-me commented 5 years ago

This makes sense.

Best Regards Zhenhua

eqy notifications@github.com 于2019年6月2日周日上午9:29写道：

@jackwish https://github.com/jackwish If relu activations are used, there is no need to use half of the representation space for negative values; thus the extra bit of precision.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/tvm/issues/2351?email_source=notifications&email_token=ABFVHDPXH4PBFFVSL2MXIQTPYMO55A5CNFSM4GMOMOS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXLWTQ#issuecomment-497990478, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFVHDM763W7HEQXQ55I6PLPYMO55ANCNFSM4GMOMOSQ .

anijain2305 commented 5 years ago

Ok, lets try to finalize the high-level design points. Lets first discuss the

Namespace for the tflite quantize style dialect

Requirements

This should support both symmetric and asymmetric.
These ops should never go through codegen. They will be lowered to low-level Relay ops (like existing conv, round etc) using FForwardRewrite or similar kind of Relay infrastructure.

Proposal

How about using relay.op._quantization as the namespace? So, the operations can be relay.op._quantization.conv2d or relay.op._quantization.quantize.

Pros

Separation of concerns - Restricts the number of ops for which TVM compute has to be written.
Good readability/debugging - Framework parsing will be easier compared to directly lowering to low-level Relay ops. Also, one can look at the quantized annotation ops and understand the quantization flow.

Cons

Getting the best performance might require some new Relay passes. It might require working on a peephole optimizer or some complicated fusion. (Symmetric quantization might already work very well with existing Relay infrastructure. Asymmetric most probably will need more efforts.)

Let me know your thoughts on this. As we achieve consensus, I can start prototyping these operators with stubbing implementation.

tqchen commented 5 years ago

@FrozenGene @jackwish can you also try to send a proposal as well? it would be great to have a global picture of what is in everyone's mind

FrozenGene commented 5 years ago

@tqchen We are very busy at our one internal project this period. I will talk with @jackwish next Monday. However, sending the proposal maybe should wait us finishing this project. Sorry for that.

apivovarov commented 5 years ago

After NCHW support was removed from tflite.py 3 weeks ago #3141 all TFLite models can not be compiled for ARM cpu and Mali GPU.

anijain2305 commented 5 years ago

@tqchen @FrozenGene @jackwish

I have added a prototype patch. I think it will be helpful to use that patch to drive the discussion further.

FrozenGene commented 5 years ago

@anijain2305 see the code quickly and I know your thought (combine operator to complete q_conv2d). However as commented before, how do we integrate with qnnpack when we don't have output_min / output_max? I think we could have these two arguments, if mxnet don't have, we could leave them the default values.

anijain2305 commented 5 years ago

@FrozenGene Thanks for replying. I might be wrong, but I don't think it is a good design to take one codegen backend like QNNPACK and make changes all the way into Relay APIs to make the connection. In my opinion, APIs must be minimal.

But, your point of using QNNPACK is completely valid. I have been thinking about that myself, dreading the painful experience of write tensorized kernel for Intel x86, and hoping to somehow use OpenVINO/MKLDNN. But, similarly, I don't think adding MKLDNN/OpenVINO arguments in the Relay API will be right choice either there.

One way to handle this is to separate out the Relay operators API that we are discussing and the infrastructure to use external codegen like QNNPACK. I think it is entirely possible to write Relay passes for each codegen backend and then rewrite/fuse the Relay ops in a manner that the codegen backend can understand. In this case, we do not creep in the backend specific idiosyncracies into the Relay op API, while also having a well-defined infrastructure that shows how to add external codegens.

FrozenGene commented 5 years ago

@anijain2305 I understand your thought. I agree we should make the api minimal. However, no matter what way, q_conv2d’s int32 output should be clamped into uint8 range. If you don’t pass min / max, you also need do output = std::max(output, 0) and output = std::min(output, 255) then return output. So why not we set the default the value output_min = 0 / output_max = 255, and make the computation be output = std::max(output, output_min) and output= std::min(output, output_max) which will be suitable for tflite / mxnet / qnnpack and so on... API design is very important, we should consider as far as we could(tflite / mxnet , even other library we should also consider, qnnpack is a very high performance library on arm cpu, we can not avoid discussing it in my opinion), otherwise we have to do tricky workaround in the future when we do something. This is my point I wish to express before.

shoubhik commented 5 years ago

@FrozenGene a clarifying question to your above comment. If we pass in the output scale and shift can we not compute int32-> int8 by simply adding more nodes in the graph.

anijain2305 commented 5 years ago

@FrozenGene For the output_min and max, isn't the out_dtype enough? If its uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I don't see any reason the values will be any different, unless you want to fuse the quantized relu in the quantized convolution from the starting itself. Please let me know if I am understanding something wrong. I think we should not fuse operators in the frontend and let Relay graph fusion take care of that.

Let's see what others think about this. @tqchen @yzhliu @ZihengJiang What are your thoughts on this?

jnorwood commented 5 years ago

The tflite quantized convolution reference implementation passes in both limits as int32 values. It appears to me this would let them simulate smaller than 8 bit quantizations, if that is something you want to support.

this is from tensorflow/lite/kernels/internal/reference/conv.h

` acc = MultiplyByQuantizedMultiplier(acc, output_multiplier, output_shift);

      acc += output_offset;

      acc = std::max(acc, output_activation_min);

      acc = std::min(acc, output_activation_max);

`

zhenhuaw-me commented 5 years ago

It appears to me this would let them simulate smaller than 8 bit quantizations.

If simulating 8 smaller bit is the case, 8 bit should be able to hold activation min/max value.

FrozenGene commented 5 years ago

@FrozenGene a clarifying question to your above comment. If we pass in the output scale and shift can we not compute int32-> int8 by simply adding more nodes in the graph.

doesn't understand your comment fully. do you mean could we avoid int32 -> int8 computation? If so, I think we can not. We need requant operation (int32 -> int8) (https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/internal/reference/conv.h#L171)

FrozenGene commented 5 years ago

It appears to me this would let them simulate smaller than 8 bit quantizations.

If simulating 8 smaller bit is the case, 8 bit should be able to hold activation min/max value.

8 bits could hold. But what the value output_min / output_max is ? I think @jnorwood want to express this point. Because we can not just simply use out_dtype to decide what the value range is. But if we insert clip op in frontend, I think it also could handle. Need some logic to calculate the min / max. see my next comment.

FrozenGene commented 5 years ago

@FrozenGene For the output_min and max, isn't the out_dtype enough? If its uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I don't see any reason the values will be any different, unless you want to fuse the quantized relu in the quantized convolution from the starting itself. Please let me know if I am understanding something wrong. I think we should not fuse operators in the frontend and let Relay graph fusion take care of that.

Let's see what others think about this. @tqchen @yzhliu @ZihengJiang What are your thoughts on this?

I think it is ok. If we do this way, we should insert one clamp if we have activation. Like our tflite frontend

# If we have fused activations
if fused_activation_fn != ActivationFunctionType.NONE:
  if weight_tensor_type == TensorType.UINT8:
     # implement this function  
     output_min, output_max = self.calculate_activation_range_uint8(output_scale, output_zero_point, fused_activation_fn)
     # insert clip
     out = _op.clip(out, output_min, output_max)
  out = self.convert_fused_activation_function(out, fused_activation_fn)

anijain2305 commented 5 years ago

I think it is ok. If we do this way, we should insert one clamp if we have activation. Like our tflite frontend

Yes, I agree with that. That's exactly what I was thinking.

jnorwood commented 5 years ago

The min and max are not conditional on existence of activation operation in the original model. They are there to saturate the downscaled and offset adjusted 32 bit signed int accumulator to the min and max value of the uint8 quantized bit range.

Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation.

anijain2305 commented 5 years ago

Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation

Ah, I see. That finally makes sense. So, this is not about activation. This is about what representation one is using for storing the floating point values. For example, if it is 7-bits, we will need the output min/max saturations. Cool, I will add them into the API and add corresponding documentation.

zhenhuaw-me commented 5 years ago

So, this is not about activation.

Of course it comes from activation, and is related to zero point and scale. For this min/max activation:

They are even named with activation when used in computing kernel
The min/max is generated at the prepare stage of convolution
The function in 2 eventually calls CalculateActivationRangeQuantizedImpl
Min/max are set to the representable value range of a data type ONLY when there is no activation is found in the fused operator.

apache / tvm

[RFC][Quantization] Support quantized models from TensorflowLite #2351

Op quantize

Op quantized_conv2d

Op dequantize

Namespace for the tflite quantize style dialect

Requirements

Proposal

Pros

Cons