Convolution performance issues: NCHW slower than NHWC

alvoron commented 7 months ago

I have observed 2 convolution performance issues while using ACL Convolution via oneDNN 3.2.

NCHW is much less efficient than NHWC. I have got about 145% geomean latency loss across ~100 models.
f16 Convolution is much less efficient than f32 Convolution. Probably some optimizations of fp16 Convolution is planned?

morgolock commented 7 months ago

Hi @alvoron

Could you please share:

What are the build options you used to compile ACL?
Which device are you running these models?

morgolock commented 7 months ago

Hi @alvoron

NCHW is much less efficient than NHWC. I have got about 145% geomean latency loss across ~100 models.

This is expected in ACL.

NCHW was the first layout that the library supported, but when we introduced support for NHWC we decided to no longer maintain/optimize NCHW because NHWC is the most optimal layout for the types of memory accesses required in most of the operators in the library.

We recommend using NHWC (fastest changing dimension is the channels) as this layout is better suited for the types of memory accesses in most of the operators present in the library. Choose NHWC over NCHW, optimisation efforts in ACL target only NHWC and NCHW is there just for compatibility.

Hope this helps.

ARM-software / ComputeLibrary

Convolution performance issues: NCHW slower than NHWC #1079