Closed GGGGxxxxxxxxr closed 1 year ago
Hi @GGGGxxxxxxxxr
Please try using v22.11 and build for arm64-v8.2-a
which would improve int8 performance using the dot product instruction.
I can't see the code measuring the reshaping among the code you shared.
Hope this helps.
Hi! I am trying to implement a WDSR-Quantized Model via ACL on Android Cpu-based Device. Here is my code for WDSR Implementation:
The data layout of Int8 Tensors here are DataLayout::NCHW.
My Command for Library Compiling is: scons Werror=0 -j8 debug=0 asserts=1 neon=1 opencl=1 benchmark_examples=1 os=android arch=arm64-v8a;
My Test Device is: Samsung S10;
I have added the timer benchmark tool in CpuGemmConv2d.cpp to measure the time cost of each stage.
Here is my performance benchmark:
time for input reshaping: 7322 us time for Int8Gemm: 1736 us time for OutputReshaping: 3959 us
time for input reshaping: 4327 us time for Int8Gemm: 4310 us time for OutputReshaping: 22391 us
time for input reshaping: 7178 us time for Int8Gemm: 2166 us time for OutputReshaping: 3163 us
time for input reshaping: 7519 us time for Int8Gemm: 1201 us time for OutputReshaping: 4045 us
Total timecost: 79522 us
I have run over 200 iters on the execution so 80ms is the average time cost for model inference.
My question is, why most Time Consumption of ConvolutionLayer has been spent on Im2Col and Col2Im kernels? For Conv2, the Col2Im kernel has taken 20ms, nearly 1/4 of the inference latency of the whole model.
I have tried NHWC Layout, which seems even worse on Int8 DataType.
Thanks,
LEI