Aligned allocation for faster SIMD

antonfirsov commented 3 years ago

Problem

Intel's optimization manual places strong emphasis on the importance of alignment of memory addresses SIMD registers are stored/loaded from. Significant performance penalties occur when loads/stores are crossing (64 byte) cache line boundaries.

Managed (array pool) arrays give us no guarantees on the alignment of the &arrray[0] base address, which means that all ImageSharp SIMD code suffers from these penalties globally.

Idea

dotnet/runtime#27146 initially proposed a mechanism to control the alignment of objects on the pinned heap, but that feature did not get into the final implementation.

Considering that the earliest release where runtime support might be available is .NET 6.0, and it won't be available in erlier releases anyways, I'm proposing a library-specific solution:

Introduce AllocationOptions.AlignedPinned
When the flag is set, return pinned or umanaged buffers with a "hacked" base address (idea taken from https://github.com/dotnet/runtime/issues/33244#issuecomment-595848832):

IntPtr baseAddress = GetBaseAddressOfPinnedArrayOrAllocateUnmanagedBuffer();

// Align the pointer
IntPtr alignedAddress = (IntPtr)((nint)(baseAddress + sizeof(IntPtr) + (alignment - 1)) & ~(alignment - 1));

// Store the pointer to the memory baseAddress to free right before the aligned pointer 
*(((IntPtr*)alignedAddress) - 1) = baseAddress;

return AlignedBuffer(alignedAddress);

The question I can't figure out

I want to make sure there are no unwanted side-effects for the GC. What buffers should we use with this trick, considering that the (mid to large) buffers we want to align will be always temporary? Note that max lifetime of these buffers is the run of an image codec or a processing operation, pinned allocations will be typically bound to these kind of big, expensive operations and won't happen within individual primitive steps.

Should we just pin the arrays of our pool?
Or use Marshal.Alloc?
Should we prefer to use the Pinned Heap when it's available?

@tannergooding @saucecontrol any thoughts/concerns?

Trying to also summon @benaadams if it's not too impolite.

benaadams commented 3 years ago

Current CPUs aren't too fussed about unaligned loads; but they are more fussy about aligned stores; i.e. you should prefer aligned stores over aligned loads if you need to make a choice (this even applies to non-SIMD, as you'll loose the performance benefit of store forwarding if you then read back an unaligned store)

You could do a best endevor alignment ; e.g. over request the size from the ArrayPool by 16/32 bytes, get the pointer of the ref Unsafe.AsPointer(ref &array[0]) then Slice off the misaligned start and work over Span/Memory. It will remain aligned subject to a GC happening, but then if a GC happens that relocates it that will take more time than the misalignment penalty, so probably isn't worth worrying about too much?

Unsafe.AsPointer is safe here as you aren't using or dereferencing the returned pointer (which could cause a GC hole), just using its value to calculate alignment

saucecontrol commented 3 years ago

That's the exact approach I use in MagicScaler. I make an effort to grab a SIMD-aligned section of a rented array here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Core/PixelBuffer.cs#L179-L190

If the GC moves the array, which becomes unlikely once the pooled arrays get moved into gen2, the worst case is a single operation runs on unaligned memory. You'll also find my 'guardrails' feature in there, in which I mark the areas outside my aligned-offset buffer segment with a byte pattern and then check that those bytes weren't touched when the buffer is returned. It's a super quick way to check for overruns when you're going unsafe.

antonfirsov commented 3 years ago

Thanks @benaadams @saucecontrol for the suggestion! I also considered this idea, but discarded it, since we can not use explicit aligned Load/Store intrinsics without guaranteed alignment. I guess it's no longer important on today's CPU's? What about non-temporal stores which also need a guarantee on alignment?

benaadams commented 3 years ago

What about non-temporal stores which also need a guarantee on alignment?

Are they useful here? They update main memory but don't update cache; so would be good for data that becomes idle after store, however I assume it would be often read back to flow into next stage so it would be good to have the data in cache?

antonfirsov commented 3 years ago

Our internal primitives usually process data in batches, for example: https://github.com/SixLabors/ImageSharp/blob/4aed9f41d6feecd43b78f93db8b3343ee22f8d72/src/ImageSharp/Common/Helpers/SimdUtils.cs#L98

Such an batch is typically called for a pixel row (which is a lot of data), then the buffer in dest is passed to the next operation in the pipeline. We do the stores into dest, and we do not read them back before the batch is finished for the whole pixel row.

From this I assumed they might be useful for us, but I'm not sure, most of this stuff is new for me.

benaadams commented 3 years ago

L2 Cache is 256KB+ per core and L3 is 9MB+; so would be a lot of data for a row?

saucecontrol commented 3 years ago

NT stores are one of those things where you really need to be sure you're smarter than the processor before you do it. Sort of the same as prefetch these days... the processor will do better than you at managing the cache 99% of the time. You're looking at latencies in the neighborhood of 400 cycles for an NT store, which only works out if the processor is filling the cache with something else as you're storing straight to memory.

antonfirsov commented 3 years ago

Ah right! I completely forgot for a moment that L2 & L3 is also a thing 😅

antonfirsov commented 3 years ago

@saucecontrol btw GUARDRAILS is a great trick, we should adapt it in debug runs, I wonder how many security bugs would that unveil.

saucecontrol commented 3 years ago

I'm sure you would have heard about it by now if ImageSharp had any overruns 😆

It's more a time saver because it gives you instant feedback when you screw up and gives you a really solid idea where. With the array pool handing out bigger buffers than you ask for, those bugs can hide for a while. It's saved me a ton of debug time since I implemented it.

SixLabors / ImageSharp