A mechanism for specifying alignment on a field or struct should be supported.

tannergooding commented 7 years ago

Rationale

In certain high performance or specialized data structures/algorithms, it is desirable to enforce an alignment for structs, fields, or locals.

Today, CoreFX provides several specialized data structures for which the runtime either has special alignment handling (System.Numerics.Vector) or for which they have some specialized padding (https://github.com/dotnet/corefx/pull/22724).

As such, the framework/runtime should provide a mechanism for encforcing a specified alignment for structs and fields. Locals should also be included if that is feasible (I'm not sure if that is readily possible today given that attributes cannot be specified on locals).

Additional Thoughts

It might be worthwhile to additionally expose this on the existing StructLayoutAttribute as an Alignment property.

An alignment of 0 should be treated as "Automatic" (the current behavior of letting the runtime decide alignment).

A mechanism for aligning to the cache would be ideal (https://github.com/dotnet/corefx/pull/22724#issuecomment-319075196). This could perhaps be a special value that would otherwise be invalid (such as Alignment=-1). Other special alignments could also be allowed in a similar manner.

If a field specifies an alignment less than that of the struct, it should be aligned to the alignment of the struct. For example, if you do Alignment=8 on a Vector4 (which has an Alignment=16), the field should be treated as Alignment=16.

[Design Decision] If a struct specifies an alignment less than that of its first field it should either: A. Align the struct as specified and add the appropriate padding so that the first field is also aligned as specified -or- B. Align the struct as per the requirements of the first field

[EDIT] Make reference to the PR a link by @karelz

JonHanna commented 7 years ago

This could perhaps be a special value that would otherwise be invalid (such as Alignment=-1). Other special alignments could also be allowed in a similar manner.

Perhaps -4 to mean 4 octets from the end of the cache line?

tannergooding commented 7 years ago

This probably deserves/requires input from some runtime folks as well, given that they have the best understanding of how determining alignment works today.

@karelz, do you know who should be tagged?

tannergooding commented 7 years ago

Going to tag @jkotas, @fiigii, and @mellinoe right now.

This will be very useful for ensuring the backing data structures are properly aligned when they are used in combination with the Hardware Intrinsics feature.

jkotas commented 7 years ago

There are two aspects of this:

Controlling field offset alignment within type
Controlling alignment when the storage for the type is allocated (on GC heap, on stack, ...)

Is this issue about 1, 2 or both?

tannergooding commented 7 years ago

The second (controlling alignment for the entire type).

If that is provided, the first can be achieved by aligning the whole type and using the appropriate field offset attributes on the individual members.

tannergooding commented 7 years ago

Although, It does somewhat extend into the range of both when dealing with types that have an alignment but are also members of another type. I mentioned in the original comment some scenarios where this may come up.

jkotas commented 7 years ago

is GC feature. It is non pay-for-play GC feature at its core: The GC would need to look at alignment of every object in various situations (slows it down everywhere), but only a few situations benefit. It would need to be prototyped and we would need to get convinced that it is a good tradeoff to make. There was some work on this done earlier - the code is under FEATURE_STRUCTALIGN ifdefs.

tannergooding commented 7 years ago

@jkotas, thanks for the reference (going to take a look through this when I have some time)!

I'm guessing the issue isn't in the first allocation, since that can be considered "trivial". That is, you just need to allocate, at most, Size + (Alignment - 1) bytes and return the first address with the correct alignment.

So, I think the hardest part for the GC probably comes in play when the heap is compressed or when objects are otherwise moved, since alignment limits where it can be moved to. I'm wondering, however, if this can be done without bringing in too much cost.

I would think (possibly naively) that the GC would set a flag indicating whether an object is aligned (or maybe a separate tree containing these objects or something similar). Most objects are not expected to be aligned, so they don't need to do anything else. The few objects that are aligned need to relocated to an address that is still aligned. This can be any address that is between Size and Size + (Alignment - 1) bytes in length (where Size is for an address that is perfectly aligned and Size + (Alignment - 1) is an address with "worst case" alignment).

jkotas commented 7 years ago

It is the 10,000ft view of how this may work. You can tell from the 37 FEATURE_STRUCTALIGN ifdefs left over in the GC from previous attempt to implement this that it is not exactly trivial to implement. Also, I would expect that the implementation itself is not where most of the work would be - most of the work would be in both functional and performance testing.

tannergooding commented 7 years ago

I put a bit of thought into why the feature is requested (feel free to correct me if you disagree)...

On modern computers, unaligned reads/writes are (generally speaking) as fast as aligned reads/writes. The exception to this is when the load/store crosses a cache-line boundary (or worse, a page boundary).

Looking at the "Intel Optimization Manual", a load/store that crosses a cache-line boundary can take ~4.5x more cycles on modern CPUs and more on older (this is assuming I didn't miss a section that says something different for even newer processors).

The most commonly used alignments will likely be:

16 (SIMD128; SSE/SSE2)
32 (SIMD256; AVX, AVX2)
64 (SIMD512; AVX512)
Cache Line Size (also beneficial for concurrent code; although I didn't touch on that in depth)
Page Size (also beneficial for large blocks of memory, such as file reads; although I didn't touch on that in depth)

Other alignments (those between cache line size and page size), as far as I can tell, do not provide any real performance benefit. This is because there is no register which can read the data all at once and because it won't provide any additional guarantees of not crossing a cache-line or page boundary.

If getting the GC to support custom aligned types is hard (and not likely to get this feature any time soon), then is there a reasonable workaround for the near or long term?For example:

Providing a 'high-performance' API for allocating aligned blocks of memory not tracked by the GC
Providing a set of 'high-performance' APIs for manual memory management (allocating/freeing/zeroing/copying heaps/pages/blocks/etc)
- There are some issues with the existing memory management functions in the Marshal class, some of which are probably fixable

On the other hand, has any consideration been put in to support custom aligned types, but with certain limitations? For example:

custom alignment is supported, but only for specific sizes
custom alignment is supported, but only for arrays
- I can't, at least this late at night, think of any real world use-cases for heap allocated single-objects that require specific alignment, all the use cases that come to mind involve arrays and multiple reads/writes
- For single value-type objects (if required), stack respected custom alignment would likely work, even if heap respected custom-alignment didn't exist
- Only having stack respected custom alignment won't work for large arrays, since that can easily cause a stack overflow

vermorel commented 5 years ago

custom alignment is supported, but only for arrays.

Yes, arrays are our only use case. Actually, we would not even need all arrays, gaining control on byte[] arrays only would already be sufficient, thanks to MemoryMarshal.Cast.

saucecontrol commented 5 years ago

If https://github.com/dotnet/coreclr/issues/19936 is implemented, you'll at least be able to roll your own aligned buffers with the knowledge the GC will never move them.

benaadams commented 5 years ago

My interest would be for CMPXCHG16b with a object reference + tag type struct

vermorel commented 5 years ago

@saucecontrol A memory mapped file will already give you aligned buffers. However, it's an IDisposable object to deal with. To make aligned memory convenient, we need support from the GC.

saucecontrol commented 5 years ago

The issue I linked is specifically about adding GC support. It doesn't handle the alignment, but it solves the problem of the GC potentially moving something after you've found an aligned section to work with.

There's also https://github.com/dotnet/corefx/issues/31787, which addresses aligned allocation of arrays.

dotnet / runtime

A mechanism for specifying alignment on a field or struct should be supported. #22990

Rationale

Additional Thoughts