f16 convolution gives the same performance as f32

alvoron commented 3 months ago

ACL 24.07 ACL build command:

scons neon=1 opencl=0 openmp=1 cppthreads=0 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True Werror=0

benchdnn build command:

ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.so -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.so -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.so -DDNNL_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)

Reproducer commands:

taskset -c 0 ./benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1
taskset -c 0 ./benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1

NHWC layout recommended by ACL is used in reproducer. taskset is used to force single thread mode and avoid threading issues.

The 1st command (f16 convolution) gives 0.267766 ms, the 2nd one (f32 convolution) gives 0.273554 ms on Ampere. I'd expect better f16 convolution performance.

If reproducer command is called with DNNL_VERBOSE=1 then we observe 2 convolutions in f16 case:

onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f16::blocked:acdb::f0 wei_f16:ap:blocked:Acdb8a::f0 bia_undef::undef::: dst_f16::blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.501953
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb4a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.444824

and 1 convolution in fp32 case:

onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32::blocked:acdb::f0 wei_f32:a:blocked:Acdb4a::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.112061

It's not clear the purpose of the 2nd convolution in f16 case (moreover, it's f32 convolution). Probably, it's ACL integration in oneDNN issue rather than ACL issue.

alvoron commented 2 months ago

@morgolock I double checked the issue description and I think, I can't provide standalone ACL reproducer. Perhaps this issue needs to be reviewed from oneDNN integration point of view, since oneDNN calls 2 convolution primitives in fp16 case and only 1 primitive in fp32 case. So, probably, it's not ACL issue, but ACL integration into oneDNN issue. Should we ask Milos to take a look at this?

alvoron commented 1 month ago

As Milos mentioned, the shape used in the reproducer is too small to be vectorized effectively. The next step is to make shape larger and compare fp32 and fp16 performance then.

EgorDuplensky commented 1 month ago

Just trying to clarify the point regarding insufficient shape size, specifically a channel size 12. If a kernel's internal loop is performed only over the channel dimension, then, for fp16 case and, we need to compute 12 2 bytes = 24 bytes of data (192 bits), so at least 2 simd 128 bits each. For fp32 case: 12 4 bytes = 48 bytes of data (384 bits), so at least 3 simd 128 bits each. So, some performance difference is expected. Or maybe I am missing something?

alvoron commented 2 weeks ago

Indeed, increasing channel number makes difference between f32 and f16 inference: 1.57 ms vs 0.8 ms

ARM-software / ComputeLibrary

f16 convolution gives the same performance as f32 #1130