uint8 quantized model runs slower than fp32 model

ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn

https://developer.arm.com/products/processors/machine-learning/arm-nn

MIT License

1.14k stars 307 forks source link

uint8 quantized model runs slower than fp32 model #667

Closed liamsun2019 closed 1 year ago

liamsun2019 commented 1 year ago

Hi author， I encountered a question while doing inference on cortex-A55 aarch64 with CpuAcc as the backend. There are 2 models , one is fp32 and the other one is uint8 quantized. My tests showed that fp32 ran even faster than the uint8 quantized one. Just curious why this would happen. Please refer to the attachment for the 2 models. In addition, both c++ parser mode and delegate mode have the same issue. Appreciate your suggestions. Thanks. test.zip

liamsun2019 commented 1 year ago

ReduceFp32ToFp16 is set to True in my tests.

catcor01 commented 1 year ago

Hi @liamsun2019,

I am getting 2 warnings for GATHER and TRANSPOSE running your models with CpuAcc seen in your issue #666. I just want to confirm these are still present for you so I can comment correctly on the results?

Running the models on CpuAcc with the following commands I can confirm the same regression (~245ms vs ~263ms):

./ExecuteNetwork -m u8l.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i X.1 -o 2180 --number-of-threads 1 --iterations 10
./ExecuteNetwork -m fp32.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i input.55 -o 1456 --number-of-threads 1 --iterations 10

From a quick look, I cannot see any operator that runs faster on fp32 compared to uint8 model. The profiling is quite extensive so I will spend some time looking through and come back if I get something.

Kind Regards, Cathal.

catcor01 commented 1 year ago

f32.txt u8.txt

catcor01 commented 1 year ago

One thing I have noticed: average pooling (only used once in your model) is not supported in CpuAcc for uint8 and therefore the operation falls back to CpuRef.

Time Cost: ~4000us (CpuRef for uint8) vs ~117us = ~3900us = ~3.9ms. There is a time cost to fall back to CpuRef due to a memory copy before and after the operation but it is negligible compared to above: ~9us and ~5us = max 15us.

@morgolock you might have an idea on if uint8 support for average pooling 2d can be added to compute library (it seems uint8 max pool 2d support it already there). Perhaps it cannot be added because of some kind of padding? Warning message: Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.

catcor01 commented 1 year ago

Along with the above the following is what I have discovered:

The uint8 model performs quantize and de-quantize operations (NeonQuantizeWorkloadExecute#227 being the biggest time cost) which is adding up to approx 3.5-4 ms.
CpuAcc pooling 2d is slower for uint8. Can be up to 1 ms slower.
CpuAcc concat is slower for uint8. Can be up between 1-2 ms slower.
CpuRef gather operation can be twice as slow for the uint8 model (1.25ms vs 0.6ms).

catcor01 commented 1 year ago

Hello @liamsun2019,

Falling back to CpuRef is very much degrading your performance. Unfortunately, because many of the transpose and gather operations are not supported for CpuAcc, fallback is inevitable. We do not guarantee uint8 performance in CpuRef to be better than fp32 (it will actually more than likely be slower because of how it is implemented in ArmNN) which is why you are seeing worse uint8 performance. However, by using the delegate you can fallback to TfLite runtime and not CpuRef which should have efficient uint8 performance compared to float32. You can do that by running the following:

./ExecuteNetwork -m u8l.tflite -f tflite-binary --tflite-executor delegate -c CpuAcc -i X.1 -o 2180 --number-of-threads 1 --iterations 10

I hope this will improve the performance of your uint8 model.

Kind Regards, Cathal.

catcor01 commented 1 year ago

I have tried to run your model with the delegate and it fails due to the following error:

Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.
Warning: ERROR: Layer of type Pooling2d is not supported on any preferred backend [CpuAcc ]
terminate called after throwing an instance of 'armnn::Exception'
  what():  TfLiteArmnnDelegate: Exception (Failed to assign a backend to each layer) caught from optimize.

@SadikARM provided me with the following information of what is happening: " I believe why it is not falling back to the TfLite Runtime is that first IsLayerSupported() return true for Pooling2d layer which means it already delegated the layer to Arm NN then somewhere in the flow (seems like in optimization) CpuPool2dKernel::validate_arguments() called and it throw error. So in optimization level it is too late to fall back to TfLite Runtime because it delegated the graph already to Arm NN. " I will look into this and make a patch.

liamsun2019 commented 1 year ago

Hi @catcor01,

Many thanks for your time and so detailed analysis. Instead, I ran these 2 models basing on the sample codes. I made some modifications while building them, e.g, -DUSE_ARMNN_DELEGATE=0/1, to apply delegate or parser to the sample codes. I also noticed that there are many tranpose/gather operations in the model and I think that contributes some overhead to inference time. For delegate mode, I have not encountered the errors you listed. I will spend some time conducting more tests.

Thanks Ｂ.R Liam

catcor01 commented 1 year ago

Hello @liamsun2019,

A patch has been submitted to master (soon to be changed to main) fixing the above failure for CpuAcc. Your model should now be able to fully run using CpuAcc without the above error being thrown.

Kind Regards, Cathal.

liamsun2019 commented 1 year ago

Hi @catcor01，

Sorry for the late reply. I have been focusing on some other work recently. I will try this patch ASAP. Thanks for your kindly help.

keidav01 commented 1 year ago

@liamsun2019 could you let us know if this patch has fixed your issue? I will close this ticket otherwise. Thank you very much

liamsun2019 commented 1 year ago

Hi @keidav01 ,

There's no progress on my side since my attention has been absorbed by some other things so far. You can just close it and I will verify the patch ASAP. Thanks for your help.

B.R Liam

keidav01 commented 1 year ago

Thank you @liamsun2019, closing