Model running slow on tflite delegate when there is quantize operation in graph

zye1996 commented 2 years ago

Hi,

I am facing the inference performance issue when there is quantize/dequantize operation in tflite graph. The attached mobilenetv2 model takes 60ms each run according to the benchmark but the model downloaded from tensorflow website has no issue. Am I doing anything wrong or there is some setting to change?

./_deps/tensorflow-build/tools/benchmark/benchmark_model --external_delegate_path=libvx_delegate.so --graph=/home/khadas/Downloads/mobilenet_v2_224_dm05_full_integer_quant.tflite --enable_op_profiling=true
STARTING!
Log parameter values verbosely: [0]
Graph: [/home/khadas/Downloads/mobilenet_v2_224_dm05_full_integer_quant.tflite]
Enable op profiling: [1]
External delegate path: [libvx_delegate.so]
Loaded model /home/khadas/Downloads/mobilenet_v2_224_dm05_full_integer_quant.tflite
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 2.26644
Initialized session in 11.237ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:268]Op 162: default layout inference pass.
count=1 curr=1341624

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=60177 curr=60442 min=60016 max=60574 avg=60392.5 std=99

Inference timings in us: Init: 11237, First inference: 1341624, Warmup (avg): 1.34162e+06, Inference (avg): 60392.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=9.94531 overall=90.1406
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
                 [node type]              [start]     [first]    [avg ms]        [%]      [cdf%]      [mem KB]  [times called]  [Name]
     ModifyGraphWithDelegate                0.000       0.757       0.757    90.767%     90.767%         0.000          1   ModifyGraphWithDelegate/0
             AllocateTensors                0.737       0.075       0.038     9.233%    100.000%         0.000          2   AllocateTensors/0

============================== Top by Computation Time ==============================
                 [node type]              [start]     [first]    [avg ms]        [%]      [cdf%]      [mem KB]  [times called]  [Name]
     ModifyGraphWithDelegate                0.000       0.757       0.757    90.767%     90.767%         0.000          1   ModifyGraphWithDelegate/0
             AllocateTensors                0.737       0.075       0.038     9.233%    100.000%         0.000          2   AllocateTensors/0

Number of nodes executed: 2
============================== Summary by node type ==============================
                 [Node type]      [count]     [avg ms]      [avg %]     [cdf %]   [mem KB]  [times called]
     ModifyGraphWithDelegate            1        0.757      90.767%     90.767%      0.000          1
             AllocateTensors            1        0.077       9.233%    100.000%      0.000          2

Timings (microseconds): count=1 curr=834
Memory (bytes): count=0
2 nodes observed

Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
                 [node type]              [start]     [first]    [avg ms]        [%]      [cdf%]      [mem KB]  [times called]  [Name]
                 Vx Delegate                0.012      60.151      60.357   100.000%    100.000%         0.000          1   [MobilenetV2/Predictions/Reshape_1]:68

============================== Top by Computation Time ==============================
                 [node type]              [start]     [first]    [avg ms]        [%]      [cdf%]      [mem KB]  [times called]  [Name]
                 Vx Delegate                0.012      60.151      60.357   100.000%    100.000%         0.000          1   [MobilenetV2/Predictions/Reshape_1]:68

Number of nodes executed: 1
============================== Summary by node type ==============================
                 [Node type]      [count]     [avg ms]      [avg %]     [cdf %]   [mem KB]  [times called]
                 Vx Delegate            1       60.357     100.000%    100.000%      0.000          1

Timings (microseconds): count=50 first=60151 curr=60403 min=60000 max=60536 avg=60357.4 std=93
Memory (bytes): count=0
1 nodes observed

mobilenet_v2_224_dm05_full_integer_quant.zip

sunshinemyson commented 2 years ago

@zye1996 ,

Thanks for you feedback.

"The attached mobilenetv2 model takes 60ms each run according to the benchmark but the model downloaded from tensorflow website has no issue."

Do you mean the official model can get better performance than your own moblienetv2 model? Which platform you used for the benchmark? And could you double check if the official model is per-tensor quantized or per-channel quantized?

Thanks

zye1996 commented 2 years ago

per-channel quantized

Yes the official model from Tensorflow runs faster. I am on the VIM3 a311d platform.

The official one is here although it is a mobilenet v1. It took 2ms on NPU so I believe the speed is reasonable. It should be per-tensor quantized.

mobilenet_v1_0.25_224_quant.tflite.zip

zye1996 commented 2 years ago

sorry I found the reason, I did not disable per-channel quantization during model conversion. After the flag is disabled, the performance makes sense

Svenhu123 commented 1 year ago

Dear zye1996:

Can you share your the versions of tflite vx delegate and TIM-VX running on A311D?

Thanks very much.

VeriSilicon / TIM-VX

Model running slow on tflite delegate when there is quantize operation in graph #316