Closed anthony-linaro closed 8 months ago
A kind request for feedback from @atafra :)
Thanks for your efforts but unfortunately using a reference implementation which is 250x slower does not meet the standards of Open Image Denoise, which aims to provide high performance regardless of platform. DirectML wouldn't be a good solution either because it would be limited to Windows and would still not enable ARM CPUs.
At this point I think the only acceptable solution would be is writing a custom, reasonably fast convolution kernel using NEON, instead of using oneDNN. ISPC would probably be the most straightforward way to implement this. The other kernels are already implemented in ISPC.
Closing in favour of an ISPC-based one I will make shortly
This PR adds ARM64 support by way of the reference implementations.
NOTE: This goes in tandem with https://github.com/OpenImageDenoise/mkl-dnn/pull/1
Performance is ....not great.
From oidnBenchmark.exe:
Windows ARM64 (using clang-cl 16.0.6):
RT.hdr_alb_nrm.1920x1080 ... 1.04168e+06 msec/image
Linux ARM64 (using GCC 12.2.0):
RT.hdr_alb_nrm.1920x1080 ... 635797 msec/image
Windows Emulated x64 (using release fom github):
RT.hdr_alb_nrm.1920x1080 ... 4174.18 msec/image
Command lines used:
Windows (within a vcvarsall ARM64 window, VS has clang installed as part of it):
Linux:
And then both built with:
Clearly, there is serious need for improvement here (250x slowdown!). What are the options? Would a PR of say, DirectML enablement, for Windows be acceptable?
I tried enabling Arm CL, which (after some faff) compiled and linked in, but none of the implementations seemed to be a match - it always fell back to the reference implementation, and if that was disabled, a segfault occurred.
The other ARM64 jit implementations present are not applicable in this case, as they use SVE512 (which is not widely implemented yet outside of a few select/recent CPUs).