Closed juncaipeng closed 2 years ago
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
@juncaipeng ,
I am investigating the issue presently. Below are my comments to your questions.
It is we who decide which INT8 range is used ([-128, 127] or [0, 255]). In the script
python/paddle/fluid/contrib/slim/quantization/quant2_int8_mkldnn_pass.py
the class Quant2Int8MkldnnPass
has a member _var_quant_scales
. It is a map of the form
string -> (bool, tensor)
variable_name -> ( use_unsigned_int, scale_tensor )
Assuming the variable_name
is the name of a conv2d
op's output, if use_unsigned_int
equals True
, then the output of the conv2d
will be quantized to the [0, 255]
range. Otherwise it is quantized to [-128, 127]
.
To make all quantizations to the signed int8 (s8) range ([-128, 127]
), make sure the use_unsigned_int
is set to False
in the methods _gather_output_scales_from_attr()
, _gather_input_scales_from_fake()
and _update_relu_output_scales()
for all variables. Afterwards, the _var_quant_scales
map is passed to the cpu_quantize_pass
pass which performs the quantization to the desired range. Keep in memory, that if after conv2d
(or fc
) there is no quantized op, the conv2d
(or fc
) op will have the force_fp32_output
attribute set to true
and its output will be of fp32 type.
oneDNN convolution and inner_product (used in FC kernel) primitives can accept u8/s8/s32/f32 bias with s8/u8 input and s8 weights.
@juncaipeng ,
The problem seems to be with the transformation from the Quant model to FP32 model before the quantization is applied. The FP32 model obtained there is faulty, it gives 0.0 accuracy. The problem looks similar to the one we investigated some time ago, namely the fake-quantized weights cannot be dequantized properly using the scales stored in the Scales
input of fake_dequantize_*
operators.
Still looking into it.
@wojtuss
I comment the graph = self._update_relu_output_scales(graph)
and generate the real int8 model again, so all use_unsigned_int=False
in ( use_unsigned_int, scale_tensor )
. The intermediate tensors are as the following picture. For dequantize/in/1 and dequantize/in/3 , the max value is greater than 127 and the dtype is uint8. For dequantize/in/5, the min value is lesser than -127.
For the quantize op and int8 op(conv2d, fc, etc) in PaddleLite, the range of output tensors is [-127, 127] on ARM CPU. The difference of quantized tensor's range maybe the main problem. Is ondDNN decides the output range? Can we fix the difference?
In QAT and PaddleLite, the formula of quantization is int8 = clip(fp32, -threshold, +threshold) * 127 / threshold
. The quantization in quantize op and quantized op (conv2d, fc, etc) keeps the output range as [-127, 127].
Can you give the users an option to enable quantizing bias or not? For some models, quantizing bias maybe lead to accuracy drop.
@wojtuss The FP32 model transformed from the fake int8 model gives 0.0 accuracy, because the quantized model is generate by PACT, which is a new proposed quantization algorithm. PACT adds a clip operation to the activations before applying quantize in the QAT training stage. However, use PaddleLite to deploy this fake int8 model on ARM CPU, it gives the same accuracy as the FP32 model. Therefore, I think this FP32 model transformed from the fake int8 model doesn't have errors.
Maybe we should firstly solve the range difference.
@juncaipeng
Please correct me if my understanding is wrong:
Quant2Int8MkldnnPass
gives correct accuracy.Quant2Int8MkldnnPass
gives totally incorrect accuracy.My comments and questions:
paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.cc
and make sure that the arguments is_input_unsigned
/are_inputs_unsigned
/is_output_unsigned
are set to false
inside the methods QuantizeInput
/QuantizeInputs
/DequantizeOutput
. I have done that, but it didn't help for accuracy.@wojtuss
Quant2Int8MkldnnPass
to transform the fake int8 model to real int8 model, deploys the real int8 model and gives totally incorrect accuracy.Compared PaddleLite's optimization module and Quant2Int8MkldnnPass
, the main difference is fusing conv+bn. The former multiplies the alpha and beta of the bn layer to the scale of conv2d, so it doesn't change the quantized weights of conv2d. As you know, the latter dequantizes the weights of conv2d, fuses the conv and bn, calculates the new scale of weights. I think the above difference isn't the main reason.
The clipping is applied in fake quantize ops and isn't applied in fake dequantize ops.
As you are not familiar with PaddeLite and QAT and using Paddle executor to run the fake int8 model gives correct accuracy, you can consider the deployment difference of fake int8 model and real int8 model transformed by Quant2Int8MkldnnPass
.
In the next picture, the left is fake int8 model, the right is real int8 model. In the fake int8 model, all fake quantize ops use the formula int8 = clip(fp32, -threshold, +threshold) * 127 / threshold
, so the range of tensor A and C is [-127, 127]. For the real int8 model, the range of tensor B and D also should be [-127, 127]. We must ensure the outputs of the quantize op and the quantized op (conv2d, fc, etc) have range [-127, 127].
@juncaipeng , Thank you. I am investigating the case with all the fusions turned off, so that the real INT8 model was as similar as possible to the fake quant model. The fuse conv+bn is disabled. No success so far. With oneDNN clipping to the [-128, 127] range is done automatically by casting to uint8_t type. I will try manually enforce the [-127, 127] range.
@wojtuss
With oneDNN clipping to the [-128, 127] range is done automatically by casting to uint8_t type. I will try manually enforce the [-127, 127] range.
Thank you. You can add clip post-process in quantize op and quantized op (conv2d, fc, etc) to enforce the [-127, 127] range.
Besides, please give an option to enable quantizing bias, and another option to use uint8_t quantization for the output tensor of Relu. Therefore, the users can set this option for different quantized model.
@juncaipeng So far I made a couple of changes to fix the accuracy for INT8. Now, the accuracy is good when quantization is applied to unoptimized FP32 model. However, when the conv+bn fuse is applied before quantization, the accuracy of INT8 drops very much.
@wojtuss Can you give a PR to show the changes? Thanks.
Compared PaddleLite's optimization module and Quant2Int8MkldnnPass, the main difference is fusing conv+bn. The former multiplies the alpha and beta of the bn layer to the scale of conv2d, so it doesn't change the quantized weights of conv2d. As you know, the latter dequantizes the weights of conv2d, fuses the conv and bn, calculates the new scale of weights.
As described above, if the conv+bn fuse affects the accuracy of INT8 model, maybe PaddleInference should also use the method in PaddleLite to fuse conv+bn before quantization.
The pass of fusing quantized_conv+bn in PaddleLite. The main steps:
@juncaipeng
In my opinion there is a discrepancy between scales that come from out_threshold
attributes of some ops and the scales that come from fake_quantize_*
ops (these are the two sources of scales for activation tensors). Below I explain why I think so.
As we discussed some time ago (https://github.com/PaddlePaddle/Paddle/pull/23928) when collecting scales the highest priority is on scales from fake_quantize_*
ops.
For what I describe here I kept mul
op always fake-quantized, focused on quantization of conv2d
operators only and forced quantization to signed int8.
fake_quantize_*
ops and output scales were taken from the conv2d
's out_threshold
attribute. Accuracy was good (top1 0.78, top5 0.91). Here the scales obtained from the conv2d
's out_threshold
attribute work fine. Weight scales calculated after removing fake ops were correct (accuracy was exactly the same when the weight scales were recalculated or when original fake-quantized weights were kept and only turned into int8). Additional clipping outputs of quantize
and conv2d
ops to [-127, 127] range lowered the accuracy a little bit, so I skipped that later. When squashing conv2d+dequantize->conv2d(force_fp32_output=true)
was enabled, the output scales were totally ignored by the conv2d
ops and accuracy was 0.73. The small drop most probably comes from aliasing of quantization (FP32 output from conv2d
and FP32 output from dequantize
after conv2d
are a little bit different due to the way oneDNN convolution and their postops work).out_threshold
attribute. There are places in the graph, where the two scales coincide and should be equal, e.g.
scale
op's out_threshold
attribute should be the same as from the fake_quantize
op. Unfortunately, the scales are different and the accuracy dropped to 0.0.fake_quantize
ops. I also turned on the conv2d+bn
fuse before quantization. Then the output scale for conv2d
came either from batch_norm
's out_threshold
attribute or from fake_quantize
that came after batch_norm
. Input scales still from fake_quantize
ops. Accuracy: 0.44. When a squash conv2d+dequantize -> conv2d (force_fp32_output=true)
was enabled (and again the output scales were totally ignored by the conv2d
ops), accuracy was 0.76.
conv2d+relu
fuse. Then the output scale for conv2d
came either from the fake_quantize
op comming after relu
(the highest priority) or from the relu
's out_threshold
attribute (lower priority, but in some cases the only source of the output scale).
conv2d+dequantize
squash accuracy was 0.74. After applying dequantize+quantize
and conv2d+dequantize
squashes accuracy was 0.76.elementwise_add
. elementwise_add
operators use input and output scales from fake_quantize
ops. Accuracy was 0.54. With squashes accuracy was 0.78.
conv+elementwise_add
fuse.
conv2d
ops. The situation with scales is similar to cases 4. and 5. Accuracy was 0.5. With squashes accuracy was 0.74.In my opinion the symptoms testify that the scales collected from the out_threshold
attribute are somehow inconsistent with scales from the fake_quantize
operators and this is something we cannot fix for QAT->INT8 transformation in Paddle.
@juncaipeng I have not ready any fix to be merged yet. I will prepare a PR soon. Scale propagation algorithm needs a fix. Also, when only signed int8 is used, accuracy is 0.79. Then unsigned int8 is also used, the accuracy is 0.76. I will add an option to disable using unsigned int8.
@juncaipeng
A fix for the issue is submitted: https://github.com/PaddlePaddle/Paddle/pull/31783
Please verify it. On the small dataset attached to this issue INT8 accuracy is 0.8 (0.94 top5) on my i9 (SKX-like) machine (still using uint8_t
where appropriate).
@wojtuss 👌
@wojtuss
When the QAT applies PACT, the input tensor of the fake_quantize_ops is clipped by pre-process function, so the output scales came from 'out_threshold' are different from the output scales came from fake_quantize_ops. When the QAT doesn't apply PACT, the output scales are the same. (⊙o⊙)…
In #31783, you have fix the scale propagation bug of scale op, so the accuracy of mobilenetv3 is correct.
There are another model that the real int8 model has incorrect results. Please check again, and the details are described in the readme. (Link:https://dubox.com/s/1NKjpX8atMhX7BzixGlIDgQ Password:ix25) This model is also generated by QAT and PACT, so we can not use the output scales from 'out_threshold'. Besides, the model has hard_swish op, can you fuse conv+hard_swish?
@juncaipeng I have verified that PR https://github.com/PaddlePaddle/Paddle/pull/31820 fixes that problem. Please, confirm that it works for you as well.
@wojtuss The #31820 does not fix the problem. The outputs of the real int8 model are still different from the fake int8 model and the fp32 model.
The output image of the fp32 model:
The output image of the fake int8 model:
The output image of the real int8 model:
However, if comment the graph = self._gather_output_scales_from_attr(graph)
in quant2_int8_mkldnn_pass.py and generate a new real int8 model, which means do not use the output scales came from 'out_threshold'. The output image of the new real int8 model as follows, which is more similar to the output image of the fake int8 model.
Considering the output images, there is a little difference between the fp32 model, the fake int8 model and the new real int8 model. It is obviously that the quantization error on Intel CPU results in the output difference. The quantization error needs to be fix.
@juncaipeng
Commenting out the self._gather_output_scales_from_attr(graph)
disabled quantization of some elementwise_add
operators. You can do that also by adding the option --ops_to_quantize "conv2d,concat"
to the save_quant_model.py
script call, making only conv2d
and concat
operators quantized. I would disable quantization of elementwise_add
(and possibly concat
) because quantizing them adds additional quantization and dequantization to the flow (due to floating point nearest_interp
ops in between), which is unfavorable to the accuracy and performance. Also the Quant model was not tuned for quantization of elementwise_add
and concat
ops. When quantization of elementwise_add
is disabled the real int8 picture looks good.
Does it make sense to you?
@juncaipeng Recently I found a problem in the quant lstm model where self._gather_output_scales_from_attr (graph)
lowered accuracy. This was because this function marks var as uint8, but the scales for that variable are still computed for signed data. I solved it by adding in _gather_output_scales_from_attr
the scales adjustment to the uint8 range by multiplying the scales by 2. In the lstm model it improved acc from 50% to 93%.
Here I prepared a PR with the fix https://github.com/PaddlePaddle/Paddle/pull/35599 Maybe it will also solve the problem you described.
I have added a PR that might also help with this issue. It is a fix for scale calculation of quantized convolution + activation. Previously output scale was applied before activation instead of after. https://github.com/PaddlePaddle/Paddle/pull/38331
notes from 5/20 meeting @jiangjiajun will check internally and close this ticket.
Download demo (Link:https://dubox.com/s/1S3PAyHFeBtyk-Xj-jeB-0Q Password:9gt7).
Refer to the readme or the following.
Problem
The fake int8 model is generated by PaddleSlim and the real int8 model is optimized model by
save_quant_model.py
.With the same input data, we find the results of fake int8 model and real int8 model have numerical difference. For most models, the numerical difference don't affect the statistical accuracy of many input samples. For specific models, the numerical difference will lead to complete incorrect results.
For mobilenetv2:
For mobilenetv3 model, we apply the origin QAT algorithm to generate a fake int8 model, but the accuracy of fake int8 model is lower than the fp32 model. Therefore, we use the PACT in QAT algorithm that adds an clip operation before fake_quantize_op, and the accuracy of the fake int8 model is the same as the fp32 model. PaddleLite deploys the fake int8 model on ARM CPU and the accuracy is the same. However, the fake int8 model deployed on Intel CPU by PaddleInference has complete incorrect results, the fake int8 model deployed on NV GPU by PaddleInference has 10% accuracy drop.
Note that, we skip quantizing the se_block in mobilenetv3 and set the
--ops_to_quantize='conv2d,fc'
forsave_quant_model.py
. For 100 imgs, the statistical accuracy as follows:Anaylsis
After comparing the int8 model deployment on ARM CPU and Intel CPU, I find two main difference for now.
For the quantize op and int8 op(conv2d, fc, etc), the range of output tensors is [-127, 127] on ARM CPU, but it is [-128, 127] or [0, 255] on Intel CPU. The difference of [-127, 127] and [-128, 127] maybe the main problem. Is ondDNN decide the output range? Can we fix the difference? When the int8 op is connected by relu or relu6 op, quantizing to [0, 255] maybe decrease the quantization loss. In oreder to carry out some test, I want to know how to set the output range as [-128, 127] for all int8 ops. Is it also decided by oneDNN?
On ARM CPU, PaddleLite don't quantize the bias of Conv and FC. On Intel CPU, PaddleInference quantizes the bias to int32. Does oneDNN support using fp32 bias in int8 kernel?
Compare intermediate tensor
Use Netorn to load the fp32 or int8 model, we know the intermediate tensor names.
python run_infer.py model_path tensor_name1 tensor_name2...
can run the model and fetch the intermediate tensors.I have compare some intermediate tensors for the fake int8 mobilenetv3 and the real int8 mobilenetv3.