CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro

alvoron commented 4 months ago

PR https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526 makes CpuGemmConv2d slower on Apple M2 / M2 Pro.

The numbers below were collected on M2 Pro. On mobilenet-v2-1.0-224 CpuGemmConv2d takes 3.18 ms before the PR and 4.12 after the PR was merged. resnet-50-pytorch - 16.37 ms before the PR; 19.67 ms after the PR

So, we have 20-30% performance degradation on CNN.

@sicong-li-arm @gunes-arm @aniraj01

morgolock commented 4 months ago

Hi @alvoron

Thanks for reporting this.

Would you please let us know how many inferences/iterations you are running?

alvoron commented 4 months ago

I run model 30 sec and calculate average exec time of each operation type. So, I have 7319 iterations of mobilenet-v2-1.0-224 and 1536 iterations of resnet-50-pytorch.

gunes-arm commented 4 months ago

Hi @alvoron

The mentioned patch should affect the start-up time, i.e. the first iteration only. I wonder if your runs configure() each time, or configure() only in the first iteration and run() in the remaining ones.

alvoron commented 4 months ago

OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource() method: https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44

morgolock commented 4 months ago

Hi @alvoron

I ran ACL's benchmark_graph_mobilenet_v2 on a device with M2 but I could not see a significant performance degradation.

See below the execution including the patch that you mentioned

% ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000  --example_args='--threads=1,--target=NEON,--type=F32'
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'c5ab4df0c11dc66db47f2070edc719923af3367e'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32 
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=6620.1732 us, STDDEV=2.62 %, MIN=6594.0000 us, MAX=10888.0000 us, MEDIAN=6608.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)

And this is without the patch

ComputeLibrary % ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000  --example_args='--threads=1,--target=NEON,--type=F32' 
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'4a9dbedfbfa66c2612c7461e60cd867b8aea825b'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32 
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2_reverted'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=6600.4505 us, STDDEV=0.88 %, MIN=6581.0000 us, MAX=8123.0000 us, MEDIAN=6596.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)

6620.1732 us - AVG=6600.4505 us = 19.7227 us 19.7227 us / 6620.1732 us = 0.003

Would you please confirm if you experience the problem on other devices? Can you please share the models you are running? Are there tflite files?

jondea commented 4 months ago

OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource() method: https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44

With DNNL_VERBOSE enabled, is OpenVINO recreating the resource or is it getting oneDNN cache hits? Some frameworks have their own caching mechanisms

alvoron commented 4 months ago

It seems the issue could be reproduced via benchdnn without OpenVINO.

ACL build command: scons neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all arch=arm64-v8.2-a build=native --jobs=8 os=macos build=native compiler_cache=ccache compiler_prefix="/Library/Developer/CommandLineTools/usr/bin/" --silent fixed_format_kernels=True

onednn configure command (run in onednn root dir): ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.dylib -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_core.dylib -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.dylib

benchdnn build command: cmake --build build --target benchdnn --parallel 7

The reproducer: DYLD_LIBRARY_PATH=$PWD/../ComputeLibrary/build ./build/tests/benchdnn/benchdnn --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic1280oc1001_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0

On M2 Pro I've got min(ms):0.255333 avg(ms):0.357945 on ACL SHA c5ab4df0c11dc66db47f2070edc719923af3367e and min(ms):0.042875 avg(ms):0.0624329 on SHA 4a9dbedfbfa66c2612c7461e60cd867b8aea825b.

@morgolock could you please try to repeat these steps?

UPD: Couple comments:

Please take oneDNN fork that is used by OpenVINO: https://github.com/openvinotoolkit/oneDNN (SHA - 4e29b771fcdfab5bdb219a495e694d6206e52b67)
You need to apply 2 small changes to oneDNN to adopt new version of ACL: https://github.com/openvinotoolkit/oneDNN/compare/19bb9f2d95ea5eca877dbf68334aaf49d24b5b4d...d76046a54ece83cef4add49c843a205b58fddd2b
I reproduced the issue using benchdnn on Mac M1 mini: total perf: min(ms):0.273542 avg(ms):0.309104 on c5ab4df0c11dc66db47f2070edc719923af3367e and total perf: min(ms):0.0366251 avg(ms):0.0638425 on 4a9dbedfbfa66c2612c7461e60cd867b8aea825b

morgolock commented 3 months ago

Hi @alvoron

Thanks for reporting this performance regression and providing so much detail.

We have merged a patch fixing the problem into the main development branch and we will do a patch release of 24.02 including the fix mentioned above.

Hope this helps

morgolock commented 3 months ago

Hi @alvoron

Closing this as it was fixed in 24.02.1

Please reopen if you require further assistance.

ARM-software / ComputeLibrary

CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro #1092