Closed FrozenGene closed 5 years ago
Starting from TFLite importer to relay sounds great. cc @jroesch @ajtulloch @yzhliu
If you want to support transforming quantized model, be careful to transform ops like quantize
to small ops like multiply
and add
for reusing kernels and optimizations like fusion
If you want to support transforming quantized model, be careful to transform ops like
quantize
to small ops likemultiply
andadd
for reusing kernels and optimizations like fusion
Thanks for reminding. However, I don't fully understand your reminder. Do you mean I should be careful quantize
or multiply
/ add
ops? If we import existing quantized model like TFLite, we shouldn't see quantize
ops any more.
Hi, I recently wrote some code to read in the tflite quantized examples and translate them to nnef output. Their operations are pretty similar to nnvm ops. I translated the two mobilenets and the four inception models. There's a cmake config that pulls down all the models and converts them. Please feel free to use whatever you want from it. I forked the NNEF Tools project, https://github.com/jnorwood and put the converter under the contrib/converters/tflite_converters/tflite_to_nnef
I only added processing for the ops I needed, and I only did quantized data. tflite uses uint8 quantization, btw, with offsets for both weights and features. Biases are int32. NNEF passes quantization configuration in a separate file from the graph. Also, note that tflite uses nhwc everywhere.
@FrozenGene I am interested in contributing to this Issue. Is it possible to share the progress?
Hey, @anijain2305 Thanks for your interest. Currently, I am doing https://github.com/dmlc/tvm/pull/3141. After that, I will start it. BTW, our internal support is based on NNVM and we have completed support it, we have the same result compared with TFLite and have better performance than TFLite. However, I have to spare some time translating to Relay when to make PR. But I have to say that I am busy this month in our product development and it will go to open source progress in my company. I will @ you when that PR is ready.
Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.
Other non-TVM related links that were used to understand quantization
Covered frameworks for now - TFLite and MxNet Target network for now - Inception V3 from TFLite. (I will create one for Mxnet) Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)
List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize
It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantizedconv2d and dequantize ops is as follows (other quantized* operators will be on the same lines as that of quantized_conv2d)
def quantize(data, scale, zero_point, out_dtype):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor.
Parameters
-----------
data: FP32 tensor
The input tensor in FP32.
scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.
zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.
out_dtype: String
The dtype of the output. Can only be int8/uint8
Returns
-------
quantized_data: int8/uint8 tensor
The quantized tensor.
"""
Key points to discuss
def quantized_conv2d(quantized_data, quantized_kernel,
input_scale, input_zero_point,
kernel_scale, kernel_zero_point,
output_scale, output_zero_point,
out_dtype,
# All the old remaining ones from conv2d
strides=(1, 1),
padding=(0, 0),
dilation=(1, 1),
groups=1,
channels=None,
kernel_size=None,
data_layout="NCHW",
kernel_layout="OIHW",
out_layout=""):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
happen outside the relay graph, i.e., the framework parsers will have to compute
the scale and offset if only min and max are provided.
Parameters
-----------
quantized_data: int8/uint8 tensor
The quantized input tensor in int8/uint8.
quantized_kernel: FP32 tensor
The quantized kernel tensor in int8/uint8.
input_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_data int8 values back to FP32.
input_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_data distribution.
kernel_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_kernel int8 values back to FP32.
kernel_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_kernel distribution.
output_scale: FP32 scalar (An attribute of the op)
The output scale is set during the quantization process using training/calibration.
The float scalar to scale the quantized_output int8 values back to FP32.
output_zero_point: Int32 zero point (An attribute of the op)
The output zero point is set during the quantization process using training/calibration.
The zero point of the quantized_output distribution.
out_dtype: String
The dtype of the quantized_output. Can only be int8/uint8.
The requantization from int32 to int8/uint8 is a part of the op compute.
out_dtype: String
The dtype of the output. Can only be int8/uint8
..... Other attributes are same as before.
Returns
-------
quantized_output: int8/uint8 tensor
The quantized tensor.
"""
Key points to discuss further
Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.
def dequantize(quantized_data, scale, zero_point, out_dtype):
"""
Dequantize takes the scale and zero_point attributes and dequantizes the
int8/uint8 tensor to FP32 tensor.
Parameters
-----------
quantized_data: int8/uint8 quantized input tensor
The input tensor in int8/uint8.
scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.
zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.
out_dtype: String
The dtype of the output. Can only be float32.
Returns
-------
data: FP32 tensor
The dequantized tensor.
"""
@anijain2305
For the q_conv2d
, we will add two more arguments.
output_min=0,
output_max=0
These will be used for restrict the output range, which could be calculated previously. see TFLite's CalculateActivationRangeUint8
function.
From my experience, we needn't q_relu
. But we need q_add
/ q_concate
and so on. I suggest we use MobilenetV2
quant model for example, which is used very widely and have common ops we should consider. For example, depthwise convolution / add / pool and so on
.
From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example,
Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.
Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.
Also, the MobilenetV2 q_add inputs require rescale... but in both q_concat and q_add you can recalculate the prior op downscale multipliers so you can eliminate the extra rescales.
Also, depending on your allocation capabilities, you can get rid of all concats.
Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.
And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.
For the
q_conv2d
, we will add two more arguments.output_min=0, output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.
Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.
Make sense. For now, I was thinking of not worrying about depth-wise conv. So, decided to take Inception V3 into account. I think given we are in the starting position, I don't have any big inclination towards any network. My motive is to focus on getting the right infrastructure in the start and showcase it with one large network. The performance micro-optimizations can then phased.
Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.
Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.
And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.
Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.
For the
q_conv2d
, we will add two more arguments.output_min=0, output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.
For the
q_conv2d
, we will add two more arguments.output_min=0, output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.
In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?
Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.
The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.
For the
q_conv2d
, we will add two more arguments.output_min=0, output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.
In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?
Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.
The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.
Yes, I agree when we don't have activation, we don't need anything. However, Another thing we should consider: How to integrate with other libraries, such as QNNPACK. QNNPACK also need output min / output max too. https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63
Here are some points to discuss:
Some of the discussions involve fusion, and that is something where TVM might be able to help. For example, in the current symmetric scheme, clip, relu6, and subsequent downcasting ops are automatically fused into the conv2d ops. While the conv2d op can simply just output int32(because followup ops will get fused).
I agree with @anijain2305 that we could try to get something minimum that is working, then start thinking about possible rewriting rules to get to some useful patterns if we decide that manual intervention is necessary.
Ideally, we should have a generic schedule template that works for any fused patterns, just as those in the current symmetric version, so we do not need to have all the different variants of fused conv2d ops
also cc @vinx13 @ZihengJiang
I want to point out that the min and max values you mentioned are not related to the activation range in the original model. They are saturation values. In the case of mobilenet, for example, which has relu_6 use everywhere, I'm printing out the min and max activation values from the tflite mobilenet V2 below. The model uses uint8 downscale between layers, and uses the min and max value to clamp/saturate the values to 0..255 for all layers in that model. The thing it could be used for (but isn't here) is for more or fewer quantization bits or for signed int quantization ... but tflite is using all uint8 quantization for MobilenetV2.
the amin and amax values below are tflite output_activation_min, output_activation_max from their quantized reference ops for conv and dw_conv.
(base) jay@jay-desktop:~/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log ` (base) jay@jay-desktop:~/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log ---------conv in_h=224, in_w=224,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1992157658,shft=-7,amin=0, amax=255 -------dwconv in_h=112, in_w=112,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1254985768,shft=-1,amin=0, amax=255 ---------conv in_h=112, in_w=112,out_h=112,out_w=112,f_h=1,f_w=1,mpy=2090511665,shft=-5,amin=0, amax=255 -------dwconv in_h=112, in_w=112,out_h=56,out_w=56,f_h=3,f_w=3,mpy=1729896231,shft=-1,amin=0, amax=255 ---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=2081950125,shft=-6,amin=0, amax=255 -------dwconv in_h=56, in_w=56,out_h=56,out_w=56,f_h=3,f_w=3,mpy=2080045879,shft=-4,amin=0, amax=255 ---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=1890535782,shft=-6,amin=0, amax=255 -------dwconv in_h=56, in_w=56,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1151606277,shft=-5,amin=0, amax=255 ---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=2089579858,shft=-7,amin=0, amax=255 -------dwconv in_h=28, in_w=28,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1410648286,shft=-4,amin=0, amax=255 ---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=1767908551,shft=-7,amin=0, amax=255 -------dwconv in_h=28, in_w=28,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1850037283,shft=-6,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1260482936,shft=-6,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1269068532,shft=-4,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1456865727,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1464063813,shft=-4,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1364297475,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1948805937,shft=-5,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=2136047634,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1671906928,shft=-5,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1327474777,shft=-6,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1330877207,shft=-5,amin=0, amax=255 ---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1497258311,shft=-7,amin=0, amax=255 -------dwconv in_h=14, in_w=14,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1076915935,shft=-6,amin=0, amax=255 ---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1124144746,shft=-6,amin=0, amax=255 -------dwconv in_h=7, in_w=7,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1083785823,shft=-2,amin=0, amax=255 ---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1240259613,shft=-5,amin=0, amax=255 ---------conv in_h=1, in_w=1,out_h=1,out_w=1,f_h=1,f_w=1,mpy=1553319078,shft=-10,amin=0, amax=255
`
similarly, for the tflite quantized inception v3 model, all those output_activation_min, output_activation_max are 0 and 255 I'll attach a zip file with the log. inv3.zip
to explain a little further ... during training they determine the range of input values, and they determine the downscale multiplier that will shrink the observed range to 0..255 (for the uint8 quantization). The fp downscale multiplier is coverted to integer mpy and right-shift constants, which are the mpy and shft values in my log. At inference time, the downscaled accumulator (after applying the downscale) may be outside the uint8 quantization range, and so they clamp/saturate to that range. In these current models, they are using uint8 quantization ... so the range is 0..255, but it appears to me they are providing the min and max to support other numbers of bits in the quantization. I see support for several 4 bit gpu implementations recently, so maybe this is to support something like that.
Some comments for @anijain2305 's reply :)
Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.
Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.
A network uses operators (or layers or anything we'd like to call it) regardless of the accumulation format. The format is part of a software system's mechanism. So, I guess we don't need a accumulator_dtype
and the out_dtype
is what we want. The discussion is about whether we put requantization inside the conv2d op.
And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.
Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.
I was saying extending existing tensor rather than introduce new tensor type. I assume that this won't lead to new Relay opt :)
EDIT: Btw, the channel-wise quantization parameter is likely to be included in TensorFlow/TFLite, also the TVM stack as a roadmap. In this way, it could be easier to manage a tensor described parameter.
Regarding @jnorwood 's comments on output min/max of conv2d.
Your observations about the values of output min max are correct. But they are still activations. One thing I always try to deliver is that, the INT8 values in quantization are a representation of original FP32 values.
When we talking about ReLU6 activations, it means that in FP32 format, the op outputs FP32 values in range [0, 6]. For INT8 quantization, INT8 data is an representation of FP32 value, which means, the output min/max (which is typically [0, 255] of INT8 type in pre-provided quantized MobileNet) are representing [0, 6] of FP32 type - the INT8 0/255 is actually FP32 0/6. Try the output scale (0.023528477177023888) with the activation min/max, we will get value range like [0, 5.999761581420898] (from output of the first conv of the pre-provided quantized MobileNet).
Conclusions can easily draw once we have this in mind :)
I would suggest to design the infrastructure that supports both symmetric/asymmetric quantization. We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.
- namespace for the tflite quantize style dialect
I think this is required for both asymmetric and symmetric quantization. These ops will be rewritten to low-level instructions by a Relay pass. How about using relay.op._quantization
as the namespace? So, the operations can be relay.op._quantization.conv2d
or relay.op._quantization.quantize
.
- List of ops that might need tvm's compute declaration
I am not sure yet. The only unknown to me are the special rounding operations that are used in converting the Floating point to Integer multiplication in scaling the quantized conv matrix. But, they might already be covered in current low-level ops.
- set of possible passes that lower the rest into the core ops
I was hoping to re-use the FForwardRewrite infrastructure to lower the ops. Do you anticipate more passes here?
We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.
All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.
We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.
All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.
TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...
This is most probably out of the context of the issue, but is it possible for all of the people commenting here to join a conference call for an hour and figure out the next steps? I can take notes and document them here for everybody else to see. I think it will be more productive.
re "conference calls". I totally agree that calling or in person sync will speed up reaching consensus. Doing most of the development in the public archivable process is preferred https://docs.tvm.ai/contribute/committer_guide.html#public-archive-principle
We do need to acknowledge the overhead of the asynchronous communication, but should also acknowledge the gains we get by leaving a trace for the broader community. I would encourage us to try to rely more on asynchronous communication in public channels first. The main bottleneck of asynchronous discussion is the overhead of latency and a good way to improve it is to
Here is a possible proposal:
We could also use the slack for semi-sync chats, but please note that everything relates to design decision need to be properly sent back to the public channel. I understand that there is more overhead in this approach, but I believe it is a price worth paying to get more people involved.
TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...
You might also consider symmetric signed int8 for weights, and unsigned uint8 for for source and destination, since uint8 will give an extra bit of precision following activations. Intel appears to preferentially support this form in their examples, and their new DLBoost avx512 vector instructions also appear to preferentially support this form.
https://intel.github.io/mkl-dnn/ex_int8_simplenet.html
https://www.intel.ai/nervana/wp-content/uploads/sites/53/2018/05/Lower-Numerical-Precision-Deep-Learning-Inference-Training.pdf
These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in π’ππ ππππππππ‘8(π’8) format, the other in π ππππππππ‘8(π 8) format with the accumulation in π ππππππππ‘32(π 32) format.
TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...
You might also consider symmetric signed int8 for weights, and unsigned uint8 for for source and destination, since uint8 will give an extra bit of precision following activations. Intel appears to preferentially support this form in their examples, and their new DLBoost avx512 vector instructions also appear to preferentially support this form.
https://intel.github.io/mkl-dnn/ex_int8_simplenet.html
https://www.intel.ai/nervana/wp-content/uploads/sites/53/2018/05/Lower-Numerical-Precision-Deep-Learning-Inference-Training.pdf
These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in π’ππ ππππππππ‘8(π’8) format, the other in π ππππππππ‘8(π 8) format with the accumulation in π ππππππππ‘32(π 32) format.
I am sorry, but I fail to get the reasoning between your comment uint8 will give an extra bit of precision following activations, and the material you listed. Would you please make it a bit more clear? AFAIK, uint8 and int8 has same value capacity, so there could be no extra precision.
@jackwish If relu activations are used, there is no need to use half of the representation space for negative values; thus the extra bit of precision.
This makes sense.
Best Regards Zhenhua
eqy notifications@github.com δΊ2019εΉ΄6ζ2ζ₯ε¨ζ₯ δΈε9:29ειοΌ
@jackwish https://github.com/jackwish If relu activations are used, there is no need to use half of the representation space for negative values; thus the extra bit of precision.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/tvm/issues/2351?email_source=notifications&email_token=ABFVHDPXH4PBFFVSL2MXIQTPYMO55A5CNFSM4GMOMOS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXLWTQ#issuecomment-497990478, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFVHDM763W7HEQXQ55I6PLPYMO55ANCNFSM4GMOMOSQ .
Ok, lets try to finalize the high-level design points. Lets first discuss the
How about using relay.op._quantization
as the namespace? So, the operations can be relay.op._quantization.conv2d
or relay.op._quantization.quantize
.
Let me know your thoughts on this. As we achieve consensus, I can start prototyping these operators with stubbing implementation.
@FrozenGene @jackwish can you also try to send a proposal as well? it would be great to have a global picture of what is in everyone's mind
@tqchen We are very busy at our one internal project this period. I will talk with @jackwish next Monday. However, sending the proposal maybe should wait us finishing this project. Sorry for that.
After NCHW
support was removed from tflite.py
3 weeks ago #3141 all TFLite models can not be compiled for ARM cpu and Mali GPU.
@tqchen @FrozenGene @jackwish
I have added a prototype patch. I think it will be helpful to use that patch to drive the discussion further.
@anijain2305 see the code quickly and I know your thought (combine operator to complete q_conv2d). However as commented before, how do we integrate with qnnpack when we don't have output_min / output_max? I think we could have these two arguments, if mxnet don't have, we could leave them the default values.
@FrozenGene Thanks for replying. I might be wrong, but I don't think it is a good design to take one codegen backend like QNNPACK and make changes all the way into Relay APIs to make the connection. In my opinion, APIs must be minimal.
But, your point of using QNNPACK is completely valid. I have been thinking about that myself, dreading the painful experience of write tensorized kernel for Intel x86, and hoping to somehow use OpenVINO/MKLDNN. But, similarly, I don't think adding MKLDNN/OpenVINO arguments in the Relay API will be right choice either there.
One way to handle this is to separate out the Relay operators API that we are discussing and the infrastructure to use external codegen like QNNPACK. I think it is entirely possible to write Relay passes for each codegen backend and then rewrite/fuse the Relay ops in a manner that the codegen backend can understand. In this case, we do not creep in the backend specific idiosyncracies into the Relay op API, while also having a well-defined infrastructure that shows how to add external codegens.
@anijain2305 I understand your thought. I agree we should make the api minimal. However, no matter what way, q_conv2dβs int32 output should be clamped into uint8 range. If you donβt pass min / max, you also need do output = std::max(output, 0)
and output = std::min(output, 255)
then return output. So why not we set the default the value output_min = 0 / output_max = 255, and make the computation be output = std::max(output, output_min)
and output= std::min(output, output_max)
which will be suitable for tflite / mxnet / qnnpack and so on... API design is very important, we should consider as far as we could(tflite / mxnet , even other library we should also consider, qnnpack is a very high performance library on arm cpu, we can not avoid discussing it in my opinion), otherwise we have to do tricky workaround in the future when we do something. This is my point I wish to express before.
@FrozenGene a clarifying question to your above comment. If we pass in the output scale and shift can we not compute int32-> int8 by simply adding more nodes in the graph.
@FrozenGene For the output_min and max, isn't the out_dtype enough? If its uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I don't see any reason the values will be any different, unless you want to fuse the quantized relu in the quantized convolution from the starting itself. Please let me know if I am understanding something wrong. I think we should not fuse operators in the frontend and let Relay graph fusion take care of that.
Let's see what others think about this. @tqchen @yzhliu @ZihengJiang What are your thoughts on this?
The tflite quantized convolution reference implementation passes in both limits as int32 values. It appears to me this would let them simulate smaller than 8 bit quantizations, if that is something you want to support.
this is from tensorflow/lite/kernels/internal/reference/conv.h
` acc = MultiplyByQuantizedMultiplier(acc, output_multiplier, output_shift);
acc += output_offset;
acc = std::max(acc, output_activation_min);
acc = std::min(acc, output_activation_max);
`
It appears to me this would let them simulate smaller than 8 bit quantizations.
If simulating 8 smaller bit is the case, 8 bit should be able to hold activation min/max value.
@FrozenGene a clarifying question to your above comment. If we pass in the output scale and shift can we not compute int32-> int8 by simply adding more nodes in the graph.
doesn't understand your comment fully. do you mean could we avoid int32 -> int8 computation? If so, I think we can not. We need requant
operation (int32 -> int8) (https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/internal/reference/conv.h#L171)
It appears to me this would let them simulate smaller than 8 bit quantizations.
If simulating 8 smaller bit is the case, 8 bit should be able to hold activation min/max value.
8 bits could hold. But what the value output_min / output_max is ? I think @jnorwood want to express this point. Because we can not just simply use out_dtype
to decide what the value range is. But if we insert clip
op in frontend, I think it also could handle. Need some logic to calculate the min / max. see my next comment.
@FrozenGene For the output_min and max, isn't the out_dtype enough? If its uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I don't see any reason the values will be any different, unless you want to fuse the quantized relu in the quantized convolution from the starting itself. Please let me know if I am understanding something wrong. I think we should not fuse operators in the frontend and let Relay graph fusion take care of that.
Let's see what others think about this. @tqchen @yzhliu @ZihengJiang What are your thoughts on this?
I think it is ok. If we do this way, we should insert one clamp if we have activation. Like our tflite frontend
# If we have fused activations
if fused_activation_fn != ActivationFunctionType.NONE:
if weight_tensor_type == TensorType.UINT8:
# implement this function
output_min, output_max = self.calculate_activation_range_uint8(output_scale, output_zero_point, fused_activation_fn)
# insert clip
out = _op.clip(out, output_min, output_max)
out = self.convert_fused_activation_function(out, fused_activation_fn)
I think it is ok. If we do this way, we should insert one clamp if we have activation. Like our tflite frontend
Yes, I agree with that. That's exactly what I was thinking.
The min and max are not conditional on existence of activation operation in the original model. They are there to saturate the downscaled and offset adjusted 32 bit signed int accumulator to the min and max value of the uint8 quantized bit range.
Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation.
Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation
Ah, I see. That finally makes sense. So, this is not about activation. This is about what representation one is using for storing the floating point values. For example, if it is 7-bits, we will need the output min/max saturations. Cool, I will add them into the API and add corresponding documentation.
So, this is not about activation.
Of course it comes from activation, and is related to zero point and scale. For this min/max activation:
CalculateActivationRangeQuantizedImpl
Let me reference @ajtulloch 's comment about quantization workflow firstly:
However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.
In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.
I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of https://github.com/dmlc/tvm/pull/2116, it is just a supplement for TVM's quantization.
After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.
[x] Support TFLite FP32 Relay frontend. PR: https://github.com/dmlc/tvm/pull/2365
[ ] Support TFLite INT8 Relay frontend
[ ] Extend the attribute of the convolution and related ops to support quantization
[ ] Auto-TVM on ARM CPU can work with INT8
Welcome any feedback.