ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.81k stars 774 forks source link

Depthwise convolution fp16 performance drop #1109

Closed alvoron closed 4 months ago

alvoron commented 4 months ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'

Platform: Ampere

Operating System: Ubuntu 22.04.4 LTS

Problem description: In some cases fp16 convolution takes more time than the same fp32 convolution:

f16 benchdnn reproducer

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g1152mb1_ic1152oc1152_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2

f32 benchdnn reproducer (completely the same set of arguments, dt differs only)

benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g1152mb1_ic1152oc1152_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2

f16 benchdnn command gives me 0.074-0.079 ms. f32 benchdnn command gives me 0.045-0.047 ms.

alvoron commented 4 months ago

Another reproducer f16 (avg 0.037 ms)

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g480mb1_ic480oc480_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1

f32 (avg 0.031 ms)

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g480mb1_ic480oc480_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1
morgolock commented 4 months ago

Hi @alvoron

I've noticed your build is using cppthreads=1 openmp=0, I'd suggest you change to use cppthreads=0 openmp=1. Can you please try this to see if it helps?

alvoron commented 4 months ago

By some reason I can't reproduce my initial results. Now I have the following results:

OMP:
f16 results: 0.067 ms
f32 results: 0.126 ms

cppthreads:
f16 results: 0.076 ms
f32 results: 0.135 ms

TBB:
f16 results: 0.136 ms
f32 results: 0.191 ms

Please let me spend some time to reproduce this issue (if any) again.

alvoron commented 4 months ago

I think I'll close this ticket for now. I was able to reproduce this issue with benchdnn, however it is related to TBB - by some reason it gives latency spikes for f16 (latency is in 10-15 times higher than usual value).