dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.38k stars 4.75k forks source link

Inline ManagedThreadId #91205

Open EgorBo opened 1 year ago

EgorBo commented 1 year ago

Currently, both Environment.CurrentManagedThreadId and Thread.CurrentThread.ManagedThreadId (special cased in JIT) both emit a single helper call to get thread Id. These APIs seem to be perf-sensitive and since we now have TLS expansion we can easily optimize these by doing: 1) Disable JIT opt for Thread.CurrentThread.ManagedThreadId (where it recognizes two calls and folds them to a helper call) 2) Make Environment.CurrentManagedThreadId to simply return Thread.CurrentThread.ManagedThreadId like it used to. 2) Surface Thread.m_InternalThread to managed land so then it will be a single mov operation.

My quick experiments show 2x-3x perf improvements

Example:

[MethodImpl(MethodImplOptions.NoInlining)]
static int ThreadId()
{
    return Thread.CurrentThread.ManagedThreadId;
}

Current codegen:

; Assembly listing for method Program:ThreadId():int (Tier1)
       sub      rsp, 40
       call     CORINFO_HELP_GETCURRENTMANAGEDTHREADID
       nop      
       add      rsp, 40
       ret      

Expected codegen:


; Assembly listing for method Program:ThreadId():int (Tier1)
       sub      rsp, 40
       mov      rax, qword ptr GS:[0x0058]
       mov      rax, qword ptr [rax+0x30]
       cmp      dword ptr [rax+0x70], 2
       jl       SHORT G_M10143_IG07
       mov      rax, qword ptr [rax+0x78]
       mov      rax, qword ptr [rax+0x10]
       test     rax, rax
       je       SHORT G_M10143_IG07
       mov      rax, bword ptr [rax]
       add      rax, 16
G_M10143_IG03:
       mov      rcx, gword ptr [rax+0x18]
       test     rcx, rcx
       jne      SHORT G_M10143_IG05
       ;; this call should be colder?
       call     [System.Threading.Thread:InitializeCurrentThread():System.Threading.Thread]
       mov      rcx, rax
G_M10143_IG05:
       cmp      dword ptr [rcx], ecx
       mov      eax, dword ptr [ecx+0x..] ;; access m_InternalThread
       add      rsp, 40
       ret      
G_M10143_IG07:
       mov      ecx, 2
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE_NOCTOR_OPTIMIZED
       jmp      SHORT G_M10143_IG03

Concerns: Mono and NativeAOT.

ghost commented 1 year ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details
Currently, both `Environment.CurrentManagedThreadId` and `Thread.CurrentThread.ManagedThreadId` (special cased in JIT) both emit a single helper call to get thread Id. These APIs seem to be perf-sensitive and since we now have TLS expansion we can easily optimize these by doing: 1) Disable JIT opt for `Thread.CurrentThread.ManagedThreadId` (where it recognizes two calls and folds them to a helper call) 2) Make `Environment.CurrentManagedThreadId` to simply return `Thread.CurrentThread.ManagedThreadId` like it used to. 2) Surface `Thread.m_InternalThread` to managed land so then it will be a single mov operation. My quick experiments show 3x-4x perf improvements Concerns: Mono and NativeAOT.
Author: EgorBo
Assignees: -
Labels: `area-System.Threading`
Milestone: -
ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
Currently, both `Environment.CurrentManagedThreadId` and `Thread.CurrentThread.ManagedThreadId` (special cased in JIT) both emit a single helper call to get thread Id. These APIs seem to be perf-sensitive and since we now have TLS expansion we can easily optimize these by doing: 1) Disable JIT opt for `Thread.CurrentThread.ManagedThreadId` (where it recognizes two calls and folds them to a helper call) 2) Make `Environment.CurrentManagedThreadId` to simply return `Thread.CurrentThread.ManagedThreadId` like it used to. 2) Surface `Thread.m_InternalThread` to managed land so then it will be a single mov operation. My quick experiments show 3x-4x perf improvements Concerns: Mono and NativeAOT.
Author: EgorBo
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
EgorBo commented 1 year ago

Prototype: https://github.com/EgorBo/runtime-1/commit/5b1fcb8713ae5b3b985845c85f58ffd5a337782c

[Benchmark]
public bool ThreadId1() => Thread.CurrentThread.ManagedThreadId == 42;

[Benchmark]
public bool ThreadId2() => Environment.CurrentManagedThreadId == 42;

Windows-x64, Ryzen 7950X:

|    Method |                   Toolchain |      Mean |
|---------- |---------------------------- |----------:|
| ThreadId1 | \Core_Root_base\corerun.exe | 0.7658 ns |
| ThreadId2 | \Core_Root_base\corerun.exe | 0.7457 ns |
|           |                             |           |
| ThreadId1 |   \Core_Root_PR\corerun.exe | 0.3993 ns |
| ThreadId2 |   \Core_Root_PR\corerun.exe | 0.3928 ns |

NativeAOT will likely regress so I didn't clean it for NAOT here and kept the current behavior.