dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444

Open voinokin opened 6 years ago

voinokin commented 6 years ago

The idea is intuitive, though I'm not sure it was ever sounded in context of CLR JIT - why not use X/Y/ZMM registers for "hot" local variables to avoid stack memory accesses, just like common GP registers are used to load/store the values? (I'm not talking here about operations other than load/store MOVQ/MOVD because it's much deeper topic which may include auto-vectorization and other funny stuff.)

There are always up to at least 6 volatile SIMD registers, and the number of regs used may be increased up to the size of SIMD register file. With more complex techniques this may provide up to 8 regs for x86/SSE+, up to 16 regs for x64/SSE+, up to 32 regs for x64/AVX-512 (future). These numbers may be achievable in the context of CLR due to the fact that at the moment few code in system assemblies uses vectors, and to my understanding SIMD ops are now only used for FP operations otherwise.

Even taking into account the store forwarding mechanisms implemented in modern CPUs when accessing memory, the significant speed-up could be achieved. One extra point is that on HyperThreaded CPUs the register files are independent on each other, whereas memory access circuitry is mostly shared by (sub-)cores.

category:design theme:register-allocator skill-level:expert cost:large impact:large

tannergooding commented 6 years ago

@voinokin, the new Hardware Intrinsics feature (and the existing System.Numerics.Vector) feature makes heavy use of SIMD instructions.

voinokin commented 6 years ago

@tannergooding, indeed it does (and my respect for participating in all this!) Still, it is system or user code developer who chooses whether to use such intrinsics or not in particular place up to the app's need. So this does not contradict:

These numbers may be achievable in the context of CLR due to the fact that at the moment few code in system assemblies uses vectors, and to my understanding SIMD ops are now only used for FP operations otherwise.

What I mean logging this record here is - there are lots of places in system assemblies which DO NOT use any SIMD facilities at least for now, and in really many cases there WILL NEVER BE any relation to SIMD. In such cases, adding the ability to trade stack memory access ops for ops with (unused!) SIMD regs will improve performance.

tannergooding commented 6 years ago

However, this may also:

If something like this would be done, there would need to be an initial prototype clearly showing the gains this would provide and any drawbacks that would incur.

voinokin commented 6 years ago

Here is live example. At the moment I'm developing the tool intended to sort large sets of data (up to 100 Gb). Some steps of the algorithm showed perf improvement when I manually placed some of the local variables to SIMD registers available thru HW intrinstics which otherwise were automatically allocated on stack by JIT. With just putting 2 local vars to SIMD regs the tool gained 10-20% improvement in throughoutput because the steps I mention are on the critical path (well, most part of the code is on a critical path when one's talking about sorting algorithms ;-) ).

mikedn commented 6 years ago

In theory it's a good idea. In practice it may be difficult to prove that this is a consistent improvement.

AFAIR someone suggested this years ago to the VC++ guys but I don't think they implemented it. If that's because they didn't have time or because there are problems associated with this idea I do not know.

One extra point is that on HyperThreaded CPUs the register files are independent on each other, whereas memory access circuitry is mostly shared by (sub-)cores.

Hmm, last time I checked recent CPUs (e.g. Skylake) had 2 load ports so memory loads technically have throughput 1/2. Instructions such as movd have throughput only 1.

Recently I played a bit with movd to "reinterpret" float as int and the results didn't seem too promising. Going through memory seemed faster, at least in some scenarios.

voinokin commented 6 years ago

If something like this would be done, there would need to be an initial prototype clearly showing the gains this would provide and any drawbacks that would incur.

True - the scope is large.

In theory it's a good idea. In practice it may be difficult to prove that this is a consistent improvement.

I'm willing to participate in this, but that's a matter of my spare time unfortunately. Maybe in some time I will prepare perf test results measured current vs. suggested (will try to copy machine code to asm replacing stack access with SIMD regs access) for some common cases.

mikedn commented 6 years ago

At the moment I'm developing the tool intended to sort large sets of data (up to 100 Gb).

I wonder if what helps in that scenario is the fact that the variables are kept in registers or the fact that perhaps you're freeing a bit of CPU cache memory.

voinokin commented 6 years ago

At the moment I'm developing the tool intended to sort large sets of data (up to 100 Gb). I wonder if what helps in that scenario is the fact that the variables are kept in registers or the fact that perhaps you're freeing a bit of CPU cache memory.

The variables' footprint is just 16 bytes, which is no more than 1 cache line.

voinokin commented 6 years ago

BTW, the issue dotnet/runtime#10394 I logged a while ago is related exactly to attempt to calculate and compare performance w.r.t. different microarchitectures. Could you please suggest some way how to do this? Is it required to make my personal modified build of JIT code and play with it, or there is better way to do?

RussKeldorph commented 6 years ago

@dotnet/jit-contrib

ArtBlnd commented 6 years ago

I don't think that does really improves for general scenarios. hot memory access means that variable will be used more than general, which will be cached on CPU cache line. (storing on stack for hot variable? seems JIT did wrong register selection)

also, load and storing on SIMD registers will cause stall unless you are going to use SIMD instructions for hot memory area.

If there is any cases that makes hot memory area to stack. needs take a look closer to it.

voinokin commented 6 years ago

@ArtBlnd I'm aware of reg read stalls, still the cache line access may be slower and occupies load or store port. IIRC the numbers which are added are: 3 cycles for reg read stall vs 4 cycles when reading from L1D cache. I will confirm these numbers later providing the source, and will also confirm the instruction encoding lengths for both cases.

Meanwhile, according to the tables from both Agner Fog and Intel themselves, the numbers for unpenalized READ on Intel CPUs Nehalem thru Skylake are: MOV r, [m] - L=2c, T=0.5c, 1uop, runs on p2/3 (these load ports better be busy with something more useful :-) ) MOVD r32/r64, x/ymm - L=2c, T=1c, 1uop, runs on p0 (L=1c before Skylake, overall better before Haswell) MOVQ r64, x/ymm - same as above

Unpenalized WRITE: MOV [m], r - L=2c, T=1c, 2uop, runs on p2/3/[7] + p4 (L=3c, T=1c, 1uop before Skylake) MOVD x/ymm, r32/r64 - L=2c, T=1c, 1uop, runs on p5 (L=1c before Skylake, overall better before Haswell) MOVQ x/ymm, r64 - same as above

Also, with the support of the feature suggested, it will become possible to directly transfer data between local vars - it's quite common case to see sequences of MOVs between two stack locations in the code CLR JIT produces from IL. When vars are in SIMD regs it's just enough to do MOVAPS / MOVQ / MOVDQA, these instructions already have very good numbers - L=0-1c, T=0.25-0.33c, 1uop, run on p0/1/5, basically the same as common MOV r, r. This is ways faster than loading value from stack to GP reg, then storing it back. Also, there is some benefit for the cases when it comes to the conversion between int <-> float data performed now with CVTxx2yy - the source value is already in SIMD reg, and it's not uncommon that the resulting value will be kept in other local var for some time.

Maybe some time later I will add numbers for AMDs (have no deep experience with these).

A side note - I doubt that storing more than one local var in SIMD reg is good idea due to the timing of PINSRx/PEXTRx instructions - L=3c, T=2c, 2uops.

voinokin commented 6 years ago

Here are numbers for L1D cache access taken from https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

image

voinokin commented 6 years ago

Regarding reg read stalls The same manual tells in section "3.5.2.1 ROB Read Port Stalls":

As a micro-op is renamed, it determines whether its source operands have executed and been written to the reorder buffer (ROB), or whether they will be captured “in flight” in the RS or in the bypass network. Typically, the great majority of source operands are found to be “in flight” during renaming. Those that have been written back to the ROB are read through a set of read ports. Since the Intel Core microarchitecture is optimized for the common case where the operands are “in flight”, it does not provide a full set of read ports to enable all renamed micro-ops to read all sources from the ROB in the same cycle. When not all sources can be read, a micro-op can stall in the rename stage until it can get access to enough ROB read ports to complete renaming the micro-op. This stall is usually short-lived. Typically, a micro-op will complete renaming in the next cycle, but it appears to the application as a loss of rename bandwidth. [...skipped, then:...] Starting with Intel microarchitecture code name Sandy Bridge, ROB port stall no longer applies because data is read from the physical register file.

From Agner Fog's manual http://www.agner.org/optimize/microarchitecture.pdf section "9 Sandy Bridge and Ivy Bridge pipeline":

9.9 Register read stalls Register read stalls has been a serious, and often neglected, bottleneck in previous processors since the Pentium Pro. All Intel processors based on the P6 microarchitecture and its successors, the Pentium M, Core and Nehalem microarchitectures have a limitation of two or three reads from the permanent register file per clock cycle. This bottleneck has now finally been removed in the Sandy Bridge and Ivy Bridge. In my experiments, I have found no practical limit to the number of register reads.

ArtBlnd commented 6 years ago

@voinokin Okay, than this makes sense.

Anyways, can I have some cases? It will be great if you attach assemblies or source code(whatever its C or C++, C#) to help this out.

voinokin commented 6 years ago

@ArtBlnd I'll come back later with some examples from real life to be close to "general cases", hopefully with perf measured.

voinokin commented 6 years ago

Couple thoughts and observations I've got on this topic while working on my hi-perf multithreaded app.

  1. When variable roundtrips GP reg<->stack loc. are replaced with GP reg<->SIMD reg roundtrips, overall impression is that although the code I updated this way works with more or less same speed (maybe slightly slower on Nehalem which still has reg read stalls bottleneck mentioned couple posts earlier), OTHER methods started working slightly faster according to VTune. My first guess is that the pressure on load/store CPU ports and overall RAM bus pressure coming from such methods lessens giving more breath to other methods accessing memory. I'll try to confirm this and provide numbers some time later. If this idea is right, then replacing stack loc. accesses with SIMD regs accesses may happen to be even more beneficial on HyperThreaded CPUs since they have execution units located on the same physical core shared among logical cores (this is how I understand it), which of course include load/store units. Not sure about inter-GP/SIMD reg transfer units, but I suppose they are independent among logical cores since the register files are independent.

  2. (Related) When implementing C++ style iterators with C# having their data stored as VTs on stack and code fully inlineable, it appears that there is significant amount of stack loc. <-> GP reg roundtrips. Such iterators are often used in tight loops, so such roundtrips are no good thing. I'll try to model replacement with SIMD reg <-> GP reg roundtrips to confirm benefits. Also, I'll investigate this closer for typical C# iteration loops over array elements including ubiquitous for loops.

P.S. This is again related to Intel CPUs, can't tell anything regarding AMDs for now since I have no experience with their uarch yet, and this is not primary priority to me for now.

tannergooding commented 5 years ago

It might be worth noting that the current Intel® 64 and IA-32 Architectures Optimization Reference Manual (April 2019) actually suggests spilling general-purpose registers to XMM registers (I also see this suggestion in the April 2018 edition, but didn't dive further back).

image

briansull commented 5 years ago

We can't spill general purpose registers that hold GC-refs of Byrefs into the XMM registers, since we don't currently support reporting such registers in the GC info. (and it would a a lot of work add such support)

EgorBo commented 1 year ago

Related article: https://shipilev.net/jvm/anatomy-quarks/20-fpu-spills/