Depthwise convolution fp16 performance drop

alvoron commented 4 months ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'

Platform: Ampere

Operating System: Ubuntu 22.04.4 LTS

Problem description: In some cases fp16 convolution takes more time than the same fp32 convolution:

f16 benchdnn reproducer

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g1152mb1_ic1152oc1152_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2

f32 benchdnn reproducer (completely the same set of arguments, dt differs only)

benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g1152mb1_ic1152oc1152_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2

f16 benchdnn command gives me 0.074-0.079 ms. f32 benchdnn command gives me 0.045-0.047 ms.

alvoron commented 4 months ago

Another reproducer f16 (avg 0.037 ms)

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g480mb1_ic480oc480_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1

f32 (avg 0.031 ms)

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g480mb1_ic480oc480_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1

morgolock commented 4 months ago

Hi @alvoron

I've noticed your build is using cppthreads=1 openmp=0, I'd suggest you change to use cppthreads=0 openmp=1. Can you please try this to see if it helps?

alvoron commented 4 months ago

By some reason I can't reproduce my initial results. Now I have the following results:

OMP:
f16 results: 0.067 ms
f32 results: 0.126 ms

cppthreads:
f16 results: 0.076 ms
f32 results: 0.135 ms

TBB:
f16 results: 0.136 ms
f32 results: 0.191 ms

Please let me spend some time to reproduce this issue (if any) again.

alvoron commented 4 months ago

I think I'll close this ticket for now. I was able to reproduce this issue with benchdnn, however it is related to TBB - by some reason it gives latency spikes for f16 (latency is in 10-15 times higher than usual value).

ARM-software / ComputeLibrary

Depthwise convolution fp16 performance drop #1109