Open joshudson opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
In case it isn't clear; this is just a straight up ABI change everywhere (requires regenerating any pre-jitted assemblies) for everything not declared extern
. The benefit is applied to delegate calls but the calling convention change cannot hurt performance nor stack depth elsewhere as it's passing one more argument in a register than it was before.
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | joshudson |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
CC @dotnet/jit-contrib
Description
Optimization is available in all 64 bit calling conventions to use fastthis to make static and member delegates equally fast.
Some code in our support library:
Some discussion on review was that this is a bad reflection on Visual Basic. That's not the issue. By rights the code should have been
but it cannot be because the code generation has to be bad. This is not a compiler issue; the code generation is just as bad in C# because you can't do any better in IL. Fundamentally the problem relates to the calling convention. Either calling a delegate to a static member or a delegate to an instance member has to be slow because the calling convention is poor. I've been over the calling convention documents, and this is a totally fixable issue.
x86:
The calling convention does not need to be adjusted. The trampoline* should generate
x64 Windows:
To avoid the problem we need to pass the this pointer in a register;
rax
,r10
, andr11
are the only choices and I know of. The trampoline generation prefersrax
trampoline generation:
If the method would call
_chkstk
you need to spill the register. There are always four slots so we can always spill it.x64 System V (covers Linux, Mac, BSD):
To avoid the problem we need to pass the this pointer in a register but
rax
is not available;r10
is available because we do not use nested functions in a way that requires a link register (alternate interpretation: the this pointer is the link register)trampoline generation:
ARM64:
x9-x15 are available. I don't know enough arm assembly to write down the trampoline.
ARM32:
the optimization is not available. There's no free space in the calling convention.
Regression?
No
Analysis
This is my analysis: The human cost of maintaining the long forms of these where people hit or think they hit this performance drain is getting large and will continue to get larger. My estimate is we have already crossed the threshold where fixing the code generation is cheaper than not doing it.
A simpler optimization that would work in some fraction of the time is to adjust closure resolution so that if it would close only over
this
; make it a private class member instead of a class member on an inner closure class of one variable.category:cq theme:register-allocator