Closed alvoron closed 2 weeks ago
@morgolock I double checked the issue description and I think, I can't provide standalone ACL reproducer. Perhaps this issue needs to be reviewed from oneDNN integration point of view, since oneDNN calls 2 convolution primitives in fp16 case and only 1 primitive in fp32 case. So, probably, it's not ACL issue, but ACL integration into oneDNN issue. Should we ask Milos to take a look at this?
As Milos mentioned, the shape used in the reproducer is too small to be vectorized effectively. The next step is to make shape larger and compare fp32 and fp16 performance then.
Just trying to clarify the point regarding insufficient shape size, specifically a channel size 12. If a kernel's internal loop is performed only over the channel dimension, then, for fp16 case and, we need to compute 12 2 bytes = 24 bytes of data (192 bits), so at least 2 simd 128 bits each. For fp32 case: 12 4 bytes = 48 bytes of data (384 bits), so at least 3 simd 128 bits each. So, some performance difference is expected. Or maybe I am missing something?
Indeed, increasing channel number makes difference between f32 and f16 inference: 1.57 ms vs 0.8 ms
ACL 24.07 ACL build command:
benchdnn build command:
Reproducer commands:
NHWC
layout recommended by ACL is used in reproducer.taskset
is used to force single thread mode and avoid threading issues.The 1st command (f16 convolution) gives
0.267766
ms, the 2nd one (f32 convolution) gives0.273554
ms on Ampere. I'd expect better f16 convolution performance.If reproducer command is called with
DNNL_VERBOSE=1
then we observe 2 convolutions in f16 case:and 1 convolution in fp32 case:
It's not clear the purpose of the 2nd convolution in f16 case (moreover, it's f32 convolution). Probably, it's ACL integration in oneDNN issue rather than ACL issue.