ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

Convolution performance issues: NCHW slower than NHWC #1079

Closed alvoron closed 6 months ago

alvoron commented 7 months ago

I have observed 2 convolution performance issues while using ACL Convolution via oneDNN 3.2.

  1. NCHW is much less efficient than NHWC. I have got about 145% geomean latency loss across ~100 models.
  2. f16 Convolution is much less efficient than f32 Convolution. Probably some optimizations of fp16 Convolution is planned?
morgolock commented 7 months ago

Hi @alvoron

Could you please share:

  1. What are the build options you used to compile ACL?
  2. Which device are you running these models?
morgolock commented 7 months ago

Hi @alvoron

NCHW is much less efficient than NHWC. I have got about 145% geomean latency loss across ~100 models.

This is expected in ACL.

NCHW was the first layout that the library supported, but when we introduced support for NHWC we decided to no longer maintain/optimize NCHW because NHWC is the most optimal layout for the types of memory accesses required in most of the operators in the library.

We recommend using NHWC (fastest changing dimension is the channels) as this layout is better suited for the types of memory accesses in most of the operators present in the library. Choose NHWC over NCHW, optimisation efforts in ACL target only NHWC and NCHW is there just for compatibility.

Hope this helps.