RenderKit / oidn

Intel® Open Image Denoise library
https://www.openimagedenoise.org/
Apache License 2.0
1.73k stars 160 forks source link

Arm support via ISPC #183

Closed anthony-linaro closed 5 months ago

anthony-linaro commented 8 months ago

This PR adds ARM64 support by way of the an ISPC kernel.

Command line used for windows: cmake -G"Ninja" .. -DTBB_ROOT=<path\to\oneTBB\compiled\for\ARM64> -DISPC_EXECUTABLE=<path\to\ISPC> -DCMAKE_BUILD_TYPE=Release

This uses (on a native ARM64 windows machine) MSVC as the compiler.

Currently it is ~7x slower than emulated x64 via DNNL, which I suspect is entirely down to memory locality issues, as the current implementation uses CHW, rather than Chw8c.

The reason for this, is that it seems to me, that the memory was never reordered when I specified Chw8c, for some reason.

I have attached the test PFM file I am using to debug this with oidnDenoise.exe (which is just a blank orange image) here: https://drive.google.com/file/d/1NTdI-id87VZVK2fiN6llBHGLu-M-wHWW/view?usp=sharing.

First 8 float values (sequentially accessed from start of array) of src in ispc_conv.cpp with tensorLayout = TensorLayout::Chw8c and tensorBlockC = 8:

0.248, 0.248, 0.248, 0.248, 0.248, 0.248, 0.248, 0.248

And the first 8 values accessed the same way with tensorLayout = TensorLayout::chw:

0.248, 0.248, 0.248, 0.248, 0.248, 0.248, 0.248, 0.248

Now clearly this isn't right, as Chw8c should read like this when sequentially accessed, according to this:

0.248, 0.206, 0.139, 0.000, 0.000, 0.000, 0.000, 0.000

I checked, and CPUInputProcess::submit() is called, which should have reordered it, but it doesn't seem to have done.

Any ideas? Did I miss something obvious?

anthony-linaro commented 8 months ago

Once again a kind request for feedback from @atafra :)

anthony-linaro commented 6 months ago

@atafra Any feedback? Even a response of "I'm looking at it" would be nice, rather than radio silence :) I tried to reach out via Xavier H through blender also

atafra commented 6 months ago

Sorry, I haven't had a chance to look at it in detail yet due to other priorities. But a 7x slowdown is massive, and I'm seeing some design issues as well. CHW is definitely not the right layout here. I'll get back to you after spending some time on this.

anthony-linaro commented 6 months ago

@atafra Thanks for the response - should I continue trying with this (ie, trying to figure out why Chw8c isn't working), or do you intend on looking in the near future?

Not having OIDN is ...not ideal for Blender.

atafra commented 6 months ago

I'll take it from here, so you don't need to spend more time on this. Thanks! I'll try to get this done for the next release.

atafra commented 5 months ago

A quick update: I've been working on an optimized convolution kernel, which turned out to be much faster than x86 emulation. Actually, I think it's quite close to optimal performance. However, it isn't based on ISPC, so I didn't use any code from this PR. Anyway, thanks for your efforts! This will be part of the upcoming OIDN release.

atafra commented 5 months ago

ARM64 support is now available in the devel branch. Contrary to my original plan, I switched to ISPC for the final solution because some C++ compilers generate too slow/inefficient code. This new optimized code became the default on x86 as well, replacing oneDNN while also improving performance. Of course NEON performance should also be very good. Please let me know if you encounter any issues.

I'm closing this PR because it got replaced with this new optimized code written from scratch.