Open SingleAccretion opened 2 years ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | SingleAccretion |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
JIT doesn't do store coalescing today except, I guess, this SIMD case but I still think there are lots of hidden perf fruits there so it will be sad to give up on that due to this hard requirement.
For reference, from Intel® 64 and IA-32 Architectures Software Developer’s Manual - 8.1.1 Guaranteed Atomic Operation
:
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically: • Reading or writing a byte. • Reading or writing a word aligned on a 16-bit boundary. • Reading or writing a doubleword aligned on a 32-bit boundary.
The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: • Reading or writing a quadword aligned on a 64-bit boundary. • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.
The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.
Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically: • MOVAPD, MOVAPS, and MOVDQA. • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128. • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.) Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided
Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page
Since accessing an underlying T
is atomic, I believe the implication is that the per-element modifications should also be atomic, but the entire 128-bit load/store may not be if it wasn't 16-byte aligned
That being said, we can likely ask our partners at Intel/AMD for confirmation.
Somewhat related: #51638
That being said, we can likely ask our partners at Intel/AMD for confirmation.
We did this when looking into #51638, though it doesn't hurt to ask again. My recollection is that nothing is guaranteed other than what's in the spec above.
cc @VSadov
Reading the new memory model document, we can see:
However, take this sample:
It can emit a store 128 bits in size:
x64/x86 does not guarantee these stores will be atomic, so it is questionable whether this transformation is correct. Of course, it is a bit hard to imagine this being an actual problem in practice.
Bonus bug: the transform in question doesn't look at volatility, so it will coalesce volatile stores, dropping any barriers (compiler or harware) in the process, though this is a bit harder to reproduce in C#.
category:cq theme:vector-codegen skill-level:expert cost:medium impact:medium