RenderKit / oidn

Intel® Open Image Denoise library
https://www.openimagedenoise.org/
Apache License 2.0
1.74k stars 160 forks source link

Arm support #172

Closed anthony-linaro closed 8 months ago

anthony-linaro commented 12 months ago

This PR adds ARM64 support by way of the reference implementations.

NOTE: This goes in tandem with https://github.com/OpenImageDenoise/mkl-dnn/pull/1

Performance is ....not great.

From oidnBenchmark.exe:

Windows ARM64 (using clang-cl 16.0.6): RT.hdr_alb_nrm.1920x1080 ... 1.04168e+06 msec/image

Linux ARM64 (using GCC 12.2.0): RT.hdr_alb_nrm.1920x1080 ... 635797 msec/image

Windows Emulated x64 (using release fom github): RT.hdr_alb_nrm.1920x1080 ... 4174.18 msec/image

Command lines used:

Windows (within a vcvarsall ARM64 window, VS has clang installed as part of it):

cmake -G"Ninja" .. -DTBB_ROOT=<path\to\oneTBB\compiled\for\ARM64> -DISPC_EXECUTABLE=<path\to\ISPC> -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER="clang-cl" -DCMAKE_CXX_COMPILER="clang-cl"

Linux:

cmake -G"Ninja" ..  -DTBB_ROOT=<path\to\oneTBB\compiled\for\ARM64> -DISPC_EXECUTABLE=<path\to\ISPC> -DCMAKE_BUILD_TYPE=Release

And then both built with:

cmake --build .

Clearly, there is serious need for improvement here (250x slowdown!). What are the options? Would a PR of say, DirectML enablement, for Windows be acceptable?

I tried enabling Arm CL, which (after some faff) compiled and linked in, but none of the implementations seemed to be a match - it always fell back to the reference implementation, and if that was disabled, a segfault occurred.

The other ARM64 jit implementations present are not applicable in this case, as they use SVE512 (which is not widely implemented yet outside of a few select/recent CPUs).

anthony-linaro commented 12 months ago

A kind request for feedback from @atafra :)

atafra commented 12 months ago

Thanks for your efforts but unfortunately using a reference implementation which is 250x slower does not meet the standards of Open Image Denoise, which aims to provide high performance regardless of platform. DirectML wouldn't be a good solution either because it would be limited to Windows and would still not enable ARM CPUs.

At this point I think the only acceptable solution would be is writing a custom, reasonably fast convolution kernel using NEON, instead of using oneDNN. ISPC would probably be the most straightforward way to implement this. The other kernels are already implemented in ISPC.

anthony-linaro commented 8 months ago

Closing in favour of an ISPC-based one I will make shortly