Open tannergooding opened 2 years ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | tannergooding |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
CC. @kunalspathak
Thanks @tannergooding . I will look into this, but don't think in .NET 7. Marking this for Future.
It'd be nice to have a simulated case showing it makes sense to do to see benefits, I suspect it might be handled as is currently under the hood with register renaming, mov elimination, etc
It'd be nice to have a simulated case showing it makes sense to do to see benefits, I suspect it might be handled as is currently under the hood with register renaming, mov elimination, etc
Even with register naming and micro-code caching, we've seen the additional 2-4 bytes (2 for general-purpose, 4 for simd) has real cost for code and how it can negatively impact alignments, size, and other things.
We have several methods where "zero" is initialized multiple times or repeatedly initialized and so this would be an opportunity to reduce that all while simplifying the necessary spill/restore logic at the same time since we know zero
can be special cased.
Overall it's a general problem with our CSE which is afraid of all constants (on x86/64 only) and thinks it's better to re-materialize them than run out of registers and spill some important loop-dependent variable or spill/restore a constant due to "live accross call". It can be changed with DOTNET_JitConstCSE=3
I agree 0 is the most popular one but if we could properly solve it for the general case it'd be even better
One of the issues is we explicitly don't want to CSE Zero
or if we did, we need some special LCL_CNS
or other way that we can observe the original value was a constant and what that constant was.
This is because there are many places in lowering where we check for a constant as op2
and contain/specialize if it is (namely if it is Zero
). Today if we CSE those, then we just see a LCL_VAR
instead and we can no longer convert a movmsk reg, ...; cmp ..., ...
into a ptest ..., ...
instead
Today if we CSE those, then we just see a LCL_VAR instead and we can no longer convert a movmsk
Well, that LCL_VAR is going to have a "constant" VN that we can use to get the original value - but it most likely won't survive till lowering 😞
Yep. I think it would be good if we had a better way to handle that and ensure we get all the benefits from both ends. Maybe a LCL_CNS
node or flag on LCL_VAR
or other special pattern here would be "goodness" to ensure we can always CSE constants while also still always being able to do specialized containment/etc.
How to prevent it being garbled by interop code? Would this add overhead to interop calls? Just rely on the register saving convention?
How to prevent it being garbled by interop code? Would this add overhead to interop calls? Just rely on the register saving convention?
If we use a callee-saved register then interop code is expected to restore it in its prologue. The problem that there are not so many callee-saved registers (it depends on ABI) and I guess we already rely on some of them for different things
I happened to investigate performance of DMath and noticed we create many times zeros in hot blocks of IG03 through IG06.
For references, C++ compilers stores it in a register in the beginning. https://godbolt.org/z/osTrMfooj
If we use a callee-saved register then interop code is expected to restore it in its prologue. The problem that there are not so many callee-saved registers (it depends on ABI) and I guess we already rely on some of them for different things
Right. This is why I think a caller
saved register is likely better. It means that the method can do the "most efficient" thing. Which is that it can avoid spilling (floating-point and simd constants are either single instructions or CLS_VAR constants with a dedicated location to restore from) and easily rematerialize the value if it's still needed.
Platforms like Arm64 have a dedicated "zero register" and this means that zero is nearly always easily and trivially accessible to codegen. Platforms like Arm32, x86 and x64 however do not directly expose the concept of a zero register. x86 and x64 in particular instead support the concept internally via the register renamer which is not exposed to assembly. Likewise Arm64 doesn't have a dedicated zero register for SIMD even though one for general purpose registers does exist.
Due to using SIMD to zero stack locals and the frequent need to use or compare against zero in many functions, it is often the case that at least one register is zeroed. On the other hand, not many functions are complex enough to utilize all 16 of a given set of registers. Because of this, I believe it would be beneficial at least on x64 (where 16 general purpose and 16 SIMD registers are available, this jumps to ~32 SIMD registers for AVX-512) to "soft reserve" a register to represent zero. The register allocator would have special support for assigning zero into this register and for making it the "least preferenced" register for other values (therefore it is likely the last caller save register) to ensure it can stay zero for as long as possible.
For most methods, this will ensure we initialize no more than once for each register kind and in the off-chance it needs to be "spilled", we do not actually have to incur the cost of storing the value to the stack and can trivially reconstitute it when the call returns. For methods which use many registers, there will ideally be no overall difference to what is generated today as the desired register will be unavailable and so it will fall back to whatever is available instead.
This may also be profitable on x86, but given it has half the number registers (8 general purpose and 8 SIMD regardless of AVX2 or AVX-512 support), this likely needs more testing and consideration than x64.
category:proposal theme:register-allocator