Open antonfirsov opened 3 years ago
Current CPUs aren't too fussed about unaligned loads; but they are more fussy about aligned stores; i.e. you should prefer aligned stores over aligned loads if you need to make a choice (this even applies to non-SIMD, as you'll loose the performance benefit of store forwarding if you then read back an unaligned store)
You could do a best endevor alignment ; e.g. over request the size from the ArrayPool by 16/32 bytes, get the pointer of the ref
Unsafe.AsPointer(ref &array[0])
then Slice off the misaligned start and work over Span/Memory. It will remain aligned subject to a GC happening, but then if a GC happens that relocates it that will take more time than the misalignment penalty, so probably isn't worth worrying about too much?
Unsafe.AsPointer
is safe here as you aren't using or dereferencing the returned pointer (which could cause a GC hole), just using its value to calculate alignment
That's the exact approach I use in MagicScaler. I make an effort to grab a SIMD-aligned section of a rented array here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Core/PixelBuffer.cs#L179-L190
If the GC moves the array, which becomes unlikely once the pooled arrays get moved into gen2, the worst case is a single operation runs on unaligned memory. You'll also find my 'guardrails' feature in there, in which I mark the areas outside my aligned-offset buffer segment with a byte pattern and then check that those bytes weren't touched when the buffer is returned. It's a super quick way to check for overruns when you're going unsafe.
Thanks @benaadams @saucecontrol for the suggestion! I also considered this idea, but discarded it, since we can not use explicit aligned Load/Store intrinsics without guaranteed alignment. I guess it's no longer important on today's CPU's? What about non-temporal stores which also need a guarantee on alignment?
What about non-temporal stores which also need a guarantee on alignment?
Are they useful here? They update main memory but don't update cache; so would be good for data that becomes idle after store, however I assume it would be often read back to flow into next stage so it would be good to have the data in cache?
Our internal primitives usually process data in batches, for example: https://github.com/SixLabors/ImageSharp/blob/4aed9f41d6feecd43b78f93db8b3343ee22f8d72/src/ImageSharp/Common/Helpers/SimdUtils.cs#L98
Such an batch is typically called for a pixel row (which is a lot of data), then the buffer in dest
is passed to the next operation in the pipeline. We do the stores into dest
, and we do not read them back before the batch is finished for the whole pixel row.
From this I assumed they might be useful for us, but I'm not sure, most of this stuff is new for me.
L2 Cache is 256KB+ per core and L3 is 9MB+; so would be a lot of data for a row?
NT stores are one of those things where you really need to be sure you're smarter than the processor before you do it. Sort of the same as prefetch these days... the processor will do better than you at managing the cache 99% of the time. You're looking at latencies in the neighborhood of 400 cycles for an NT store, which only works out if the processor is filling the cache with something else as you're storing straight to memory.
Ah right! I completely forgot for a moment that L2 & L3 is also a thing 😅
@saucecontrol btw GUARDRAILS
is a great trick, we should adapt it in debug runs, I wonder how many security bugs would that unveil.
I'm sure you would have heard about it by now if ImageSharp had any overruns 😆
It's more a time saver because it gives you instant feedback when you screw up and gives you a really solid idea where. With the array pool handing out bigger buffers than you ask for, those bugs can hide for a while. It's saved me a ton of debug time since I implemented it.
Problem
Intel's optimization manual places strong emphasis on the importance of alignment of memory addresses SIMD registers are stored/loaded from. Significant performance penalties occur when loads/stores are crossing (64 byte) cache line boundaries.
Managed (array pool) arrays give us no guarantees on the alignment of the
&arrray[0]
base address, which means that all ImageSharp SIMD code suffers from these penalties globally.Idea
dotnet/runtime#27146 initially proposed a mechanism to control the alignment of objects on the pinned heap, but that feature did not get into the final implementation.
Considering that the earliest release where runtime support might be available is .NET 6.0, and it won't be available in erlier releases anyways, I'm proposing a library-specific solution:
AllocationOptions.AlignedPinned
The question I can't figure out
I want to make sure there are no unwanted side-effects for the GC. What buffers should we use with this trick, considering that the (mid to large) buffers we want to align will be always temporary? Note that max lifetime of these buffers is the run of an image codec or a processing operation, pinned allocations will be typically bound to these kind of big, expensive operations and won't happen within individual primitive steps.
Marshal.Alloc
?@tannergooding @saucecontrol any thoughts/concerns?
Trying to also summon @benaadams if it's not too impolite.