SIMD store coalescing and atomicity guarantees

SingleAccretion commented 2 years ago

Reading the new memory model document, we can see:

Memory accesses to properly aligned data of primitive types are always atomic. The value that is observed is always a result of complete read and write operations.

However, take this sample:

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Problem(float[] p, Vector4 a)
{
    p[0] = a.X;
    p[1] = a.Y;
    p[2] = a.Z;
    p[3] = a.W;
}

It can emit a store 128 bits in size:

IN0003: 00000D vmovupd  xmm0, xmmword ptr [rdx]
IN0004: 000011 vmovupd  xmmword ptr [rcx+10H], xmm0

x64/x86 does not guarantee these stores will be atomic, so it is questionable whether this transformation is correct. Of course, it is a bit hard to imagine this being an actual problem in practice.

Bonus bug: the transform in question doesn't look at volatility, so it will coalesce volatile stores, dropping any barriers (compiler or harware) in the process, though this is a bit harder to reproduce in C#.

category:cq theme:vector-codegen skill-level:expert cost:medium impact:medium

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

Reading the new [memory model document](https://github.com/dotnet/runtime/pull/75790), we can see: > Memory accesses to *properly aligned* data of primitive types are always atomic. The value that is observed is always a result of complete read and write operations. However, take this sample: ```cs [MethodImpl(MethodImplOptions.NoInlining)] private static void Problem(float[] p, Vector4 a) { p[0] = a.X; p[1] = a.Y; p[2] = a.Z; p[3] = a.W; } ``` It can emit a store 128 bits in size: ```asm IN0003: 00000D vmovupd xmm0, xmmword ptr [rdx] IN0004: 000011 vmovupd xmmword ptr [rcx+10H], xmm0 ``` x64/x86 does not guarantee these stores will be atomic, so it is questionable whether this transformation is correct. Of course, it is a bit hard to imagine this being an actual problem in practice. Bonus bug: the transform in question doesn't look at volatility, so it will coalesce volatile stores, dropping any barriers (compiler or harware) in the process, though this is a bit harder to reproduce in C#.

Author:	SingleAccretion
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo commented 2 years ago

JIT doesn't do store coalescing today except, I guess, this SIMD case but I still think there are lots of hidden perf fruits there so it will be sad to give up on that due to this hard requirement.

tannergooding commented 2 years ago

For reference, from Intel® 64 and IA-32 Architectures Software Developer’s Manual - 8.1.1 Guaranteed Atomic Operation:

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically: • Reading or writing a byte. • Reading or writing a word aligned on a 16-bit boundary. • Reading or writing a doubleword aligned on a 32-bit boundary.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: • Reading or writing a quadword aligned on a 64-bit boundary. • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically: • MOVAPD, MOVAPS, and MOVDQA. • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128. • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.) Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided

Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page

Since accessing an underlying T is atomic, I believe the implication is that the per-element modifications should also be atomic, but the entire 128-bit load/store may not be if it wasn't 16-byte aligned

That being said, we can likely ask our partners at Intel/AMD for confirmation.

AndyAyersMS commented 2 years ago

Somewhat related: #51638

That being said, we can likely ask our partners at Intel/AMD for confirmation.

We did this when looking into #51638, though it doesn't hurt to ask again. My recollection is that nothing is guaranteed other than what's in the spec above.

jkotas commented 2 years ago

cc @VSadov

dotnet / runtime

SIMD store coalescing and atomicity guarantees #76503