dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.71k forks source link

[fastthis] Delegate performance being eaten up by not passing `this` in dedicated register #58955

Open joshudson opened 3 years ago

joshudson commented 3 years ago

Description

Optimization is available in all 64 bit calling conventions to use fastthis to make static and member delegates equally fast.

Some code in our support library:

        Public Function Identity(Of T)() As Func(Of T, T)
            Return IdentitySource(Of T).Instance
        End Function

        Private NotInheritable Class IdentitySource(Of T)
            Private Sub New()
            End Sub

            Private Function Identity(ByVal tvalue As T) As T
                Return tvalue
            End Function

            'Delegates to instance methods are faster than static methods
            'https://github.com/dotnet/roslyn/issues/20315#issuecomment-309583018
            'Most of the time we don't worry about this, but when the function is this trivial getting it right saves most of the cost of the call.
            Public Shared Instance As Func(Of T, T) = AddressOf New IdentitySource(Of T)().Identity
        End Class

Some discussion on review was that this is a bad reflection on Visual Basic. That's not the issue. By rights the code should have been

        Public Function Identity(Of T)() As Func(Of T, T)
            Static Func(Of T, T) source = Static Function(x) x
            Return source
        End Function

but it cannot be because the code generation has to be bad. This is not a compiler issue; the code generation is just as bad in C# because you can't do any better in IL. Fundamentally the problem relates to the calling convention. Either calling a delegate to a static member or a delegate to an instance member has to be slow because the calling convention is poor. I've been over the calling convention documents, and this is a totally fixable issue.

x86:

The calling convention does not need to be adjusted. The trampoline* should generate

    push dword thisptr
    jmp method

x64 Windows:

To avoid the problem we need to pass the this pointer in a register; rax, r10, and r11 are the only choices and I know of. The trampoline generation prefers rax

trampoline generation:

    mov rax, qword thisptr
    jmp method

If the method would call _chkstk you need to spill the register. There are always four slots so we can always spill it.

x64 System V (covers Linux, Mac, BSD):

To avoid the problem we need to pass the this pointer in a register but rax is not available; r10 is available because we do not use nested functions in a way that requires a link register (alternate interpretation: the this pointer is the link register)

trampoline generation:

    mov r10, rax                 ; Or generate an xchg here for one byte less (possible perf loss is hard to guess)
    mov rax, qword thisptr
    xchg rax, r10
    jmp method

ARM64:

x9-x15 are available. I don't know enough arm assembly to write down the trampoline.

ARM32:

the optimization is not available. There's no free space in the calling convention.

Regression?

No

Analysis

This is my analysis: The human cost of maintaining the long forms of these where people hit or think they hit this performance drain is getting large and will continue to get larger. My estimate is we have already crossed the threshold where fixing the code generation is cheaper than not doing it.

A simpler optimization that would work in some fraction of the time is to adjust closure resolution so that if it would close only over this; make it a private class member instead of a class member on an inner closure class of one variable.

category:cq theme:register-allocator

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

joshudson commented 3 years ago

In case it isn't clear; this is just a straight up ABI change everywhere (requires regenerating any pre-jitted assemblies) for everything not declared extern. The benefit is applied to delegate calls but the calling convention change cannot hurt performance nor stack depth elsewhere as it's passing one more argument in a register than it was before.

ghost commented 3 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Optimization is available in all 64 bit calling conventions to use fastthis to make static and member delegates equally fast. Some code in our support library: ```` Public Function Identity(Of T)() As Func(Of T, T) Return IdentitySource(Of T).Instance End Function Private NotInheritable Class IdentitySource(Of T) Private Sub New() End Sub Private Function Identity(ByVal tvalue As T) As T Return tvalue End Function 'Delegates to instance methods are faster than static methods 'https://github.com/dotnet/roslyn/issues/20315#issuecomment-309583018 'Most of the time we don't worry about this, but when the function is this trivial getting it right saves most of the cost of the call. Public Shared Instance As Func(Of T, T) = AddressOf New IdentitySource(Of T)().Identity End Class ```` Some discussion on review was that this is a bad reflection on Visual Basic. That's not the issue. By rights the code should have been ```` Public Function Identity(Of T)() As Func(Of T, T) Static Func(Of T, T) source = Static Function(x) x Return source End Function ```` but it cannot be because the code generation has to be bad. This is not a compiler issue; the code generation is just as bad in C# because you can't do any better in IL. Fundamentally the problem relates to the calling convention. Either calling a delegate to a static member or a delegate to an instance member has to be slow because the calling convention is poor. I've been over the calling convention documents, and this is a totally fixable issue. x86: The calling convention does not need to be adjusted. The trampoline* should generate ```` push dword thisptr jmp method ```` * I'm not going to write down the Invoke() implementations; they're obvious if given the trampolines. x64 Windows: To avoid the problem we need to pass the this pointer in a register; `rax`, `r10`, and `r11` are the only choices and I know of. The trampoline generation prefers `rax` trampoline generation: ```` mov rax, qword thisptr jmp method ```` If the method would call `_chkstk` you need to spill the register. There are always four slots so we can always spill it. x64 System V (covers Linux, Mac, BSD): To avoid the problem we need to pass the this pointer in a register but `rax` is not available; `r10` is available because we do not use nested functions in a way that requires a link register (alternate interpretation: the this pointer *is* the link register) trampoline generation: ```` mov r10, rax ; Or generate an xchg here for one byte less (possible perf loss is hard to guess) mov rax, qword thisptr xchg rax, r10 jmp method ```` ARM64: x9-x15 are available. I don't know enough arm assembly to write down the trampoline. ARM32: the optimization is not available. There's no free space in the calling convention. ### Regression? No ### Analysis This is my analysis: The human cost of maintaining the long forms of these where people hit or think they hit this performance drain is getting large and will continue to get larger. My estimate is we have already crossed the threshold where fixing the code generation is cheaper than not doing it. A simpler optimization that would work in some fraction of the time is to adjust closure resolution so that if it would close only over `this`; make it a private class member instead of a class member on an inner closure class of one variable.
Author: joshudson
Assignees: -
Labels: `tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone: -
JulieLeeMSFT commented 3 years ago

CC @dotnet/jit-contrib