H-uru / Plasma

Cyan Worlds's Plasma game engine
http://h-uru.github.io/Plasma/
GNU General Public License v3.0
205 stars 81 forks source link

Initial support for Accelerate for matrix mult #1334

Closed colincornaby closed 1 year ago

colincornaby commented 1 year ago

This adds support for Accelerate for matrix multiplication in Plasma. I'm not quite sure how we would want to patch this in, in since it doesn't quite fit the previous model of capability checks.

Accelerate automatically supports:

colincornaby commented 1 year ago

Is there ever a case where Accelerate isn't available on an Apple platform and we might need to detect that and fall back to CPU multiplication?

Do we need to link against Accelerate.framework or anything like that?

I'm looking. I thought SIMD was a much older library, but the docs state it's only available in 10.13+. Humph.

We do need to link against Accelerate. If SIMD specifically is that new then it would need to be made optional for older systems. That might require fitting it into the algorithm picker system.

colincornaby commented 1 year ago

Ahhhh, here's the source of my confusion. vDSP_mmul is the older function, and that works back to 10.2 on PPC. https://developer.apple.com/documentation/accelerate/1449984-vdsp_mmul?language=objc

I wonder if I could use that function instead.

dpogue commented 1 year ago

hmm, simd_transpose claims that it's available in 10.9, but I only see it (and simd/simd.h) in 10.10 and newer...

That is probably fine for now, and when someone is feeling bored they could look at adding support for the older methods

colincornaby commented 1 year ago

Just implemented vDSP_mmul and it works great, goes back to 10.2 (and, sigh, PPC), and avoids the transpose. I'll update the PR.

Hoikas commented 1 year ago

There is also, IIRC, IMatrixMul34 in plCoordinateInterface and plDrawableSpans that need to be optimized in this way. I wonder if it might be better to pitch the current manual SSE3 optimization and use the appropriate platform libraries, eg accelerate on macOS and DirectXMath on Windows, falling back to the standard floating point version on other platforms. I assume on Linux we could probably just be more aggressive with the compile flags.

colincornaby commented 1 year ago

The macOS build might have some trouble here. The full Mac client links against Accelerate, but there isn't enough of that stub here to make that work. I can try carrying over enough of that CMake.

colincornaby commented 1 year ago

IMatrixMul34

Interesting. It looks like this is FPU only right now? I don't see any vectorization code.

dpogue commented 1 year ago

The full Mac client links against Accelerate, but there isn't enough of that stub here to make that work. I can try carrying over enough of that CMake.

Just add "$<$<PLATFORM_ID:Darwin>:-framework Accelerate>" here?

colincornaby commented 1 year ago

FWIW - Right now on macOS under the "Zandi Stress Test" 4.6% of the time is being spent in IMatrixMul34.

Hoikas commented 1 year ago

Interesting. It looks like this is FPU only right now? I don't see any vectorization code.

Right. I'm pretty sure this lies in the code path that I was suggesting to run in a thread pool, though. That's a more aggressive optimization, though ;)

colincornaby commented 1 year ago

Interesting. It looks like this is FPU only right now? I don't see any vectorization code.

Right. I'm pretty sure this lies in the code path that I was suggesting to run in a thread pool, though. That's a more aggressive optimization, though ;)

I'm testing out an Accelerate path currently, and will likely add it to the PR. That entire path is extremely expensive right now, so it could be that both threading and vectorization needs to be done.

colincornaby commented 1 year ago

Alright - I've added the Accelerate import to CMake, and the Accelerate version of IMatrixMul34 (which I checked for correctness.)

colincornaby commented 1 year ago

DirectXMath does look very nice. It's tagged with NEON but I'm still going through looking for NEON code. NEON isn't mentioned in the README.

https://github.com/microsoft/DirectXMath

Hoikas commented 1 year ago

Yeah, the benefit to DirectXMath is that it's included with the Windows SDK, so it can be used immediately (and it already is being used in pl3DPipeline). The downside is that it will only use one instruction set, so no dynamic dispatching. But it might give better results than what the MSVC autovectorizor comes up with.

colincornaby commented 1 year ago

The downside is that it will only use one instruction set, so no dynamic dispatching. But it might give better results than what the MSVC autovectorizor comes up with.

That's a shame. Seems more ideal for consoles (which it seems like DirectXMath was originally intended for.)

I'm wondering if maybe we need to consider two utility classes or headers:

colincornaby commented 1 year ago

I'm also not opposed to just using Apple's LibDispatch everywhere for thread pools, but I'm guessing there may not be agreement on that. 😛 libDispatch is a C library, and doesn't support C++ lambdas out of the box, which is annoying.

https://github.com/apple/swift-corelibs-libdispatch

Hoikas commented 1 year ago

I'm also not opposed to just using Apple's LibDispatch everywhere for thread pools, but I'm guessing there may not be agreement on that. 😛 libDispatch is a C library, and doesn't support C++ lambdas out of the box, which is annoying.

Yeah, whatever we do should probably accept std::function so that we don't have to deal with low-level state.

I'm wondering if maybe we need to consider two utility classes or headers:

  • Vectorization, which could have several implementations including Accelerate, DXMath, and whatever ends up being used on Linux.

We added hsCpuId back in like 2013 when we discovered that there were people who were playing Uru with CPUs that didn't support SSE 3 (!!!). From what I can tell, SSE 3 was introduced in 2004, so those CPUs would have been about 9 years old at that point. It's now 2013 (10 years later), and those CPUs have doubled in age. I wonder if we should just forget the whole dynamic dispatching thing and assume that everyone running x86 has at least SSE 3, if not SSE 4. At this point, Visual Studio defaults to /arch:sse2, even for 32-bit builds, so VS is very kindly auto-vectorizing our FPU codepaths. In #1336, I found that the SSE3 skinning path is only about 0.01ms faster with one avatar than what VS is auto-vectorizing for us. So, I'm not sure that dealing with the headache of dynamic dispatching is even worth it. I do, however, think that having some manually vectorized code paths is still relevant along with the FPU path. DirectXMath can be told that the CPU supports SSE1, SSE4, AVX1, and AVX2. I do know of users who have experienced crashes due to libopus accidentally emitting AVX instructions, so assuming people have AVX is probably not a good idea. Anyway, it might be worth simplifying the code in this way, but it doesn't have to (and probably shouldn't) be in this PR.

dpogue commented 1 year ago

Supposedly DirectXMath will work on Linux, but requires providing your own sal.h file.

colincornaby commented 1 year ago

In #1336, I found that the SSE3 skinning path is only about 0.01ms faster with one avatar than what VS is auto-vectorizing for us.

That's interesting. Clang also supports auto-vectorization, but I noticed a significant improvement switching over to Accelerate for the 4x4 matrix mults (on Intel.) Not sure what instruction set it's using, this is an AVX512 CPU.

The 4x3 matrix must only saw a moderate speedup. Wondering if that's because Plasma only supports 4x4 storage, and so the matrix is not properly packed for a better performance improvement.

Hoikas commented 1 year ago

Clang might be, similarly to VS, only auto-vectorizing to something like SSE2 while Apple's Accelerate library is using AVX instructions. Apple can do that because of their aggressive hardware policy. In theory. with AVX, a 4x4 matrix could be multiplied in two "rounds" of operations (rows 1, 2; rows 3, 4) while a 4x3 matrix still requires two "rounds" (rows 1, 2; row 3). The big speedup for skipping the fourth row was probably in x87 mode with no parallel processing)