juncaipeng commented 3 years ago

Download demo (Link:https://dubox.com/s/1S3PAyHFeBtyk-Xj-jeB-0Q Password:9gt7).

Refer to the readme or the following.

Problem

The fake int8 model is generated by PaddleSlim and the real int8 model is optimized model by save_quant_model.py.

With the same input data, we find the results of fake int8 model and real int8 model have numerical difference. For most models, the numerical difference don't affect the statistical accuracy of many input samples. For specific models, the numerical difference will lead to complete incorrect results.

For mobilenetv2:

# Test 100 imgs, compare statistical accuracy.

python run_eval.py --model_path models/mobilenetv2_fp32
# test_acc1: 0.78, test_acc5: 0.95

python run_eval.py --model_path models/mobilenetv2_fake_int8
# test_acc1: 0.77, test_acc5: 0.93

python run_eval.py --model_path models/mobilenetv2_real_int8
# test_acc1: 0.77, test_acc5: 0.96

# Test 1 img, compare numerical difference.

python run_infer.py models/mobilenetv2_fp32
# max value: 0.868, arg_max: 65

python run_infer.py models/mobilenetv2_fake_int8
# max value: 0.835, arg_max: 65

python run_infer.py models/mobilenetv2_real_int8
# max value: 0.902, arg_max: 65

For mobilenetv3 model, we apply the origin QAT algorithm to generate a fake int8 model, but the accuracy of fake int8 model is lower than the fp32 model. Therefore, we use the PACT in QAT algorithm that adds an clip operation before fake_quantize_op, and the accuracy of the fake int8 model is the same as the fp32 model. PaddleLite deploys the fake int8 model on ARM CPU and the accuracy is the same. However, the fake int8 model deployed on Intel CPU by PaddleInference has complete incorrect results, the fake int8 model deployed on NV GPU by PaddleInference has 10% accuracy drop.

Note that, we skip quantizing the se_block in mobilenetv3 and set the --ops_to_quantize='conv2d,fc' for save_quant_model.py. For 100 imgs, the statistical accuracy as follows:

The fp32 mobilenetv3: test_acc1 is 0.78, test_acc5 is 0.96
The fake int8 mobilenetv3: test_acc1 is 0.76, test_acc5 is 0.95
The real int8 mobilenetv3: test_acc1 is 0.0, test_acc5 is 0.0

Anaylsis

After comparing the int8 model deployment on ARM CPU and Intel CPU, I find two main difference for now.

Int8 range

For the quantize op and int8 op(conv2d, fc, etc), the range of output tensors is [-127, 127] on ARM CPU, but it is [-128, 127] or [0, 255] on Intel CPU. The difference of [-127, 127] and [-128, 127] maybe the main problem. Is ondDNN decide the output range? Can we fix the difference? When the int8 op is connected by relu or relu6 op, quantizing to [0, 255] maybe decrease the quantization loss. In oreder to carry out some test, I want to know how to set the output range as [-128, 127] for all int8 ops. Is it also decided by oneDNN?

Quantize bias

On ARM CPU, PaddleLite don't quantize the bias of Conv and FC. On Intel CPU, PaddleInference quantizes the bias to int32. Does oneDNN support using fp32 bias in int8 kernel?

Compare intermediate tensor

Use Netorn to load the fp32 or int8 model, we know the intermediate tensor names.

python run_infer.py model_path tensor_name1 tensor_name2... can run the model and fetch the intermediate tensors.

I have compare some intermediate tensors for the fake int8 mobilenetv3 and the real int8 mobilenetv3.

Tensor name in the fake int8 mobilenetv3	Tensor name in the real int8 mobilenetv3	Tensor info in the fake int8 mobilenetv3	Tensor info in the real int8 mobilenetv3
image.quantized	quantize/out/0	avg: 59.359283 , min: -85.0 , max: 97.0 , arg_max: 134929	avg: 59.35928199404762 , min: -85 , max: 97 , arg_max: 134929
batch_norm_0.tmp_2	batch_norm_0.tmp_2	avg: 2.2439551 , min: -10.029446 , max: 14.263074 , arg_max: 193725	avg: 2.2423842 , min: -9.997816 , max: 14.297043 , arg_max: 193725
tmp_2	tmp_2	avg: 2.2459602 , min: -0.37499997 , max: 14.263075 , arg_max: 193725	avg: 2.2411668 , min: -0.375 , max: 14.297043 , arg_max: 193725
relu_0.tmp_0.quantized	dequantize/in/1	avg: 31.141064 , min: 0.0 , max: 127.0 , arg_max: 18106	avg: 62.13507453762755 , min: 0 , max: 255 , arg_max: 18106
relu_1.tmp_0.quantized	dequantize/in/2	avg: 31.885214 , min: 0.0 , max: 127.0 , arg_max: 12432	avg: 65.45506816007654 , min: 0 , max: 255 , arg_max: 12432
elementwise_add_0.tmp_0.quantized	dequantize/in/38	avg: 14.022595 , min: -49.0 , max: 105.0 , arg_max: 159600	avg: 14.208042689732142 , min: -48 , max: 113 , arg_max: 157920
relu_2.tmp_0.quantized	dequantize/in/3	avg: 14.235955 , min: 0.0 , max: 127.0 , arg_max: 30068	avg: 29.650545081313776 , min: 0 , max: 255 , arg_max: 30068
relu_3.tmp_0.quantized	dequantize/in/4	avg: 20.625692 , min: 0.0 , max: 127.0 , arg_max: 40085	avg: 42.55670539700255 , min: 0 , max: 255 , arg_max: 16739
batch_norm_6.tmp_2.quantized	dequantize/in/5	avg: 0.56964815 , min: -127.0 , max: 127.0 , arg_max: 45407	avg: -0.604405824829932 , min: -128 , max: 127 , arg_max: 3965

paddle-bot-old[bot] commented 3 years ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

wojtuss commented 3 years ago

@juncaipeng ,

I am investigating the issue presently. Below are my comments to your questions.

int8 range

It is we who decide which INT8 range is used ([-128, 127] or [0, 255]). In the script

python/paddle/fluid/contrib/slim/quantization/quant2_int8_mkldnn_pass.py

the class Quant2Int8MkldnnPass has a member _var_quant_scales. It is a map of the form

string -> (bool, tensor)
variable_name -> ( use_unsigned_int, scale_tensor )

Assuming the variable_name is the name of a conv2d op's output, if use_unsigned_int equals True, then the output of the conv2d will be quantized to the [0, 255] range. Otherwise it is quantized to [-128, 127]. To make all quantizations to the signed int8 (s8) range ([-128, 127]), make sure the use_unsigned_int is set to False in the methods _gather_output_scales_from_attr(), _gather_input_scales_from_fake() and _update_relu_output_scales() for all variables. Afterwards, the _var_quant_scales map is passed to the cpu_quantize_pass pass which performs the quantization to the desired range. Keep in memory, that if after conv2d (or fc) there is no quantized op, the conv2d (or fc) op will have the force_fp32_output attribute set to true and its output will be of fp32 type.

quantize bias

oneDNN convolution and inner_product (used in FC kernel) primitives can accept u8/s8/s32/f32 bias with s8/u8 input and s8 weights.

wojtuss commented 3 years ago

@juncaipeng , The problem seems to be with the transformation from the Quant model to FP32 model before the quantization is applied. The FP32 model obtained there is faulty, it gives 0.0 accuracy. The problem looks similar to the one we investigated some time ago, namely the fake-quantized weights cannot be dequantized properly using the scales stored in the Scales input of fake_dequantize_* operators. Still looking into it.

juncaipeng commented 3 years ago

@wojtuss

int8 range

I comment the graph = self._update_relu_output_scales(graph) and generate the real int8 model again, so all use_unsigned_int=False in ( use_unsigned_int, scale_tensor ). The intermediate tensors are as the following picture. For dequantize/in/1 and dequantize/in/3 , the max value is greater than 127 and the dtype is uint8. For dequantize/in/5, the min value is lesser than -127.

For the quantize op and int8 op(conv2d, fc, etc) in PaddleLite, the range of output tensors is [-127, 127] on ARM CPU. The difference of quantized tensor's range maybe the main problem. Is ondDNN decides the output range? Can we fix the difference?

In QAT and PaddleLite, the formula of quantization is int8 = clip(fp32, -threshold, +threshold) * 127 / threshold. The quantization in quantize op and quantized op (conv2d, fc, etc) keeps the output range as [-127, 127].

quantize bias

Can you give the users an option to enable quantizing bias or not? For some models, quantizing bias maybe lead to accuracy drop.

juncaipeng commented 3 years ago

@wojtuss The FP32 model transformed from the fake int8 model gives 0.0 accuracy, because the quantized model is generate by PACT, which is a new proposed quantization algorithm. PACT adds a clip operation to the activations before applying quantize in the QAT training stage. However, use PaddleLite to deploy this fake int8 model on ARM CPU, it gives the same accuracy as the FP32 model. Therefore, I think this FP32 model transformed from the fake int8 model doesn't have errors.

Maybe we should firstly solve the range difference.

wojtuss commented 3 years ago

@juncaipeng

Please correct me if my understanding is wrong:

Fake INT8 model deployed on ARM CPU using PaddleLite gives correct accuracy.
Fake INT8 model deployed on Intel CPU using PaddleInference gives totally incorrect accuracy.
Fake INT8 model deployed on NV GPU using PaddleInference gives 10% worse accuracy.
Real INT8 model obtained on ARM CPU using Quant2Int8MkldnnPass gives correct accuracy.
Real INT8 model obtained on Intel CPU using Quant2Int8MkldnnPass gives totally incorrect accuracy.

My comments and questions:

Clipping is always symmetric [-a, +a], so I assume real quantization should also be to signed INT8 range. I have enforced that in my scripts, but still got 0.0 accuracy. Weights seem to be quantized correctly.
Where is the clipping applied during QAT training? Before/after fake quantize? Before/after fake dequantize?
To definitely turn off using unsigned int8 for inputs and outputs (weights are always quantized to signed int8), please go to paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.cc and make sure that the arguments is_input_unsigned/are_inputs_unsigned/is_output_unsigned are set to false inside the methods QuantizeInput/QuantizeInputs/DequantizeOutput. I have done that, but it didn't help for accuracy.

juncaipeng commented 3 years ago

@wojtuss

For ARM CPU, PaddleLite has optimization module to transform the fake int8 model to real int8 model. It deploys the real int8 model and gives correct accuracy.
For Intel CPU, PaddleInference uses Quant2Int8MkldnnPass to transform the fake int8 model to real int8 model, deploys the real int8 model and gives totally incorrect accuracy.
Using Paddle executor to run the fake int8 model directly, it also gives correct accuracy.

Compared PaddleLite's optimization module and Quant2Int8MkldnnPass, the main difference is fusing conv+bn. The former multiplies the alpha and beta of the bn layer to the scale of conv2d, so it doesn't change the quantized weights of conv2d. As you know, the latter dequantizes the weights of conv2d, fuses the conv and bn, calculates the new scale of weights. I think the above difference isn't the main reason.

The clipping is applied in fake quantize ops and isn't applied in fake dequantize ops.

As you are not familiar with PaddeLite and QAT and using Paddle executor to run the fake int8 model gives correct accuracy, you can consider the deployment difference of fake int8 model and real int8 model transformed by Quant2Int8MkldnnPass.

In the next picture, the left is fake int8 model, the right is real int8 model. In the fake int8 model, all fake quantize ops use the formula int8 = clip(fp32, -threshold, +threshold) * 127 / threshold, so the range of tensor A and C is [-127, 127]. For the real int8 model, the range of tensor B and D also should be [-127, 127]. We must ensure the outputs of the quantize op and the quantized op (conv2d, fc, etc) have range [-127, 127].

wojtuss commented 3 years ago

@juncaipeng , Thank you. I am investigating the case with all the fusions turned off, so that the real INT8 model was as similar as possible to the fake quant model. The fuse conv+bn is disabled. No success so far. With oneDNN clipping to the [-128, 127] range is done automatically by casting to uint8_t type. I will try manually enforce the [-127, 127] range.

juncaipeng commented 3 years ago

@wojtuss

With oneDNN clipping to the [-128, 127] range is done automatically by casting to uint8_t type. I will try manually enforce the [-127, 127] range.

Thank you. You can add clip post-process in quantize op and quantized op (conv2d, fc, etc) to enforce the [-127, 127] range.

Besides, please give an option to enable quantizing bias, and another option to use uint8_t quantization for the output tensor of Relu. Therefore, the users can set this option for different quantized model.

wojtuss commented 3 years ago

@juncaipeng So far I made a couple of changes to fix the accuracy for INT8. Now, the accuracy is good when quantization is applied to unoptimized FP32 model. However, when the conv+bn fuse is applied before quantization, the accuracy of INT8 drops very much.

juncaipeng commented 3 years ago

@wojtuss Can you give a PR to show the changes? Thanks.

Compared PaddleLite's optimization module and Quant2Int8MkldnnPass, the main difference is fusing conv+bn. The former multiplies the alpha and beta of the bn layer to the scale of conv2d, so it doesn't change the quantized weights of conv2d. As you know, the latter dequantizes the weights of conv2d, fuses the conv and bn, calculates the new scale of weights.

As described above, if the conv+bn fuse affects the accuracy of INT8 model, maybe PaddleInference should also use the method in PaddleLite to fuse conv+bn before quantization.

The pass of fusing quantized_conv+bn in PaddleLite. The main steps:

build and find conv+bn pattern
for conv+bn, calculate the alpha and beta
multiply weight scales and alpha
update the biase

wojtuss commented 3 years ago

@juncaipeng In my opinion there is a discrepancy between scales that come from out_threshold attributes of some ops and the scales that come from fake_quantize_* ops (these are the two sources of scales for activation tensors). Below I explain why I think so.

As we discussed some time ago (https://github.com/PaddlePaddle/Paddle/pull/23928) when collecting scales the highest priority is on scales from fake_quantize_* ops. For what I describe here I kept mul op always fake-quantized, focused on quantization of conv2d operators only and forced quantization to signed int8.

In the first approach I have disabled all the optimization fuses. Input scales were taken from the fake_quantize_* ops and output scales were taken from the conv2d's out_threshold attribute. Accuracy was good (top1 0.78, top5 0.91). Here the scales obtained from the conv2d's out_threshold attribute work fine. Weight scales calculated after removing fake ops were correct (accuracy was exactly the same when the weight scales were recalculated or when original fake-quantized weights were kept and only turned into int8). Additional clipping outputs of quantize and conv2d ops to [-127, 127] range lowered the accuracy a little bit, so I skipped that later. When squashing conv2d+dequantize->conv2d(force_fp32_output=true) was enabled, the output scales were totally ignored by the conv2d ops and accuracy was 0.73. The small drop most probably comes from aliasing of quantization (FP32 output from conv2d and FP32 output from dequantize after conv2d are a little bit different due to the way oneDNN convolution and their postops work).
Then I put the highest priority on scales from the out_threshold attribute. There are places in the graph, where the two scales coincide and should be equal, e.g. Here the scale obtained from the scale op's out_threshold attribute should be the same as from the fake_quantize op. Unfortunately, the scales are different and the accuracy dropped to 0.0.
I changed back the highest priority to scales from fake_quantize ops. I also turned on the conv2d+bn fuse before quantization. Then the output scale for conv2d came either from batch_norm's out_threshold attribute or from fake_quantize that came after batch_norm. Input scales still from fake_quantize ops. Accuracy: 0.44. When a squash conv2d+dequantize -> conv2d (force_fp32_output=true) was enabled (and again the output scales were totally ignored by the conv2d ops), accuracy was 0.76.
I added conv2d+relu fuse. Then the output scale for conv2d came either from the fake_quantize op comming after relu (the highest priority) or from the relu's out_threshold attribute (lower priority, but in some cases the only source of the output scale). Accuracy: 0.52. After applying conv2d+dequantize squash accuracy was 0.74. After applying dequantize+quantize and conv2d+dequantize squashes accuracy was 0.76.
I additionally added quantization of elementwise_add. elementwise_add operators use input and output scales from fake_quantize ops. Accuracy was 0.54. With squashes accuracy was 0.78.
I added conv+elementwise_add fuse. Now the residual connections are present in some conv2d ops. The situation with scales is similar to cases 4. and 5. Accuracy was 0.5. With squashes accuracy was 0.74.

In my opinion the symptoms testify that the scales collected from the out_threshold attribute are somehow inconsistent with scales from the fake_quantize operators and this is something we cannot fix for QAT->INT8 transformation in Paddle.

wojtuss commented 3 years ago

@juncaipeng I have not ready any fix to be merged yet. I will prepare a PR soon. Scale propagation algorithm needs a fix. Also, when only signed int8 is used, accuracy is 0.79. Then unsigned int8 is also used, the accuracy is 0.76. I will add an option to disable using unsigned int8.

wojtuss commented 3 years ago

@juncaipeng A fix for the issue is submitted: https://github.com/PaddlePaddle/Paddle/pull/31783 Please verify it. On the small dataset attached to this issue INT8 accuracy is 0.8 (0.94 top5) on my i9 (SKX-like) machine (still using uint8_t where appropriate).

juncaipeng commented 3 years ago

@wojtuss 👌

juncaipeng commented 3 years ago

@wojtuss
When the QAT applies PACT, the input tensor of the fake_quantize_ops is clipped by pre-process function, so the output scales came from 'out_threshold' are different from the output scales came from fake_quantize_ops. When the QAT doesn't apply PACT, the output scales are the same. (⊙o⊙)…

In #31783, you have fix the scale propagation bug of scale op, so the accuracy of mobilenetv3 is correct.

There are another model that the real int8 model has incorrect results. Please check again, and the details are described in the readme. (Link:https://dubox.com/s/1NKjpX8atMhX7BzixGlIDgQ Password:ix25) This model is also generated by QAT and PACT, so we can not use the output scales from 'out_threshold'. Besides, the model has hard_swish op, can you fuse conv+hard_swish?

wojtuss commented 3 years ago

@juncaipeng I have verified that PR https://github.com/PaddlePaddle/Paddle/pull/31820 fixes that problem. Please, confirm that it works for you as well.

juncaipeng commented 3 years ago

@wojtuss The #31820 does not fix the problem. The outputs of the real int8 model are still different from the fake int8 model and the fp32 model.

The output image of the fp32 model:

The output image of the fake int8 model:

The output image of the real int8 model:

However, if comment the graph = self._gather_output_scales_from_attr(graph) in quant2_int8_mkldnn_pass.py and generate a new real int8 model, which means do not use the output scales came from 'out_threshold'. The output image of the new real int8 model as follows, which is more similar to the output image of the fake int8 model.

Considering the output images, there is a little difference between the fp32 model, the fake int8 model and the new real int8 model. It is obviously that the quantization error on Intel CPU results in the output difference. The quantization error needs to be fix.

wojtuss commented 3 years ago

@juncaipeng Commenting out the self._gather_output_scales_from_attr(graph) disabled quantization of some elementwise_add operators. You can do that also by adding the option --ops_to_quantize "conv2d,concat" to the save_quant_model.py script call, making only conv2d and concat operators quantized. I would disable quantization of elementwise_add (and possibly concat) because quantizing them adds additional quantization and dequantization to the flow (due to floating point nearest_interp ops in between), which is unfavorable to the accuracy and performance. Also the Quant model was not tuned for quantization of elementwise_add and concat ops. When quantization of elementwise_add is disabled the real int8 picture looks good. Does it make sense to you?

wozna commented 2 years ago

@juncaipeng Recently I found a problem in the quant lstm model where self._gather_output_scales_from_attr (graph) lowered accuracy. This was because this function marks var as uint8, but the scales for that variable are still computed for signed data. I solved it by adding in _gather_output_scales_from_attr the scales adjustment to the uint8 range by multiplying the scales by 2. In the lstm model it improved acc from 50% to 93%. Here I prepared a PR with the fix https://github.com/PaddlePaddle/Paddle/pull/35599 Maybe it will also solve the problem you described.

sfraczek commented 2 years ago

I have added a PR that might also help with this issue. It is a fix for scale calculation of quantized convolution + activation. Previously output scale was applied before activation instead of after. https://github.com/PaddlePaddle/Paddle/pull/38331

yaomichael commented 2 years ago

notes from 5/20 meeting @jiangjiajun will check internally and close this ticket.

PaddlePaddle / Paddle

Fake int8 model and real int8 model have difference outputs on Intel CPU #31103

Problem

Anaylsis

Compare intermediate tensor