Open EgorBot opened 3 weeks ago
Arm64
BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Arm64
Job-MELZVS : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-YBPLAN : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_JitDisasm=TestInner
Method | Toolchain | Mean | Error | Ratio |
---|---|---|---|---|
Test | Main | 1.025 ns | 0.0006 ns | 1.00 |
Test | PR | 1.082 ns | 0.0007 ns | 1.06 |
Bench_Test
:Flame graphs: Main vs PR 🔥 Speedscope: Main vs PR Hot asm: Main vs PR Hot functions: Main vs PR Counters: Main vs PR
Intel
BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 8 logical and 4 physical cores
Job-FXRMOH : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job-TQFOQG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_JitDisasm=TestInner
Method | Toolchain | Mean | Error | Ratio |
---|---|---|---|---|
Test | Main | 4.0075 ns | 0.0021 ns | 1.00 |
Test | PR | 0.6158 ns | 0.0013 ns | 0.15 |
Bench_Test
:Flame graphs: Main vs PR 🔥 Speedscope: Main vs PR Hot asm: Main vs PR Hot functions: Main vs PR Counters: Main vs PR
@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?, JitDisasm.asm output (see BDN_Artifacts.zip) does show that PR has a different codegen:
Main:
; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
G_M000_IG02: ;; offset=0x0008
mov x0, x1
movz x11, #0x5B0
; ............................... 32B boundary ...............................
movk x11, #0xB805 LSL #16
movk x11, #0xFAC8 LSL #32
ldr xip0, [x11]
blr xip0
G_M000_IG03: ;; offset=0x0020
ldp fp, lr, [sp], #0x10
ret lr
; Total bytes of code 40
PR:
; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
G_M000_IG02: ;; offset=0x0008
ldr x0, [x1]
movz x11, #0x9088
; ............................... 32B boundary ...............................
movk x11, #0x1D1C LSL #16
movk x11, #0xE000 LSL #32
cmp x0, x11
bne G_M000_IG05
G_M000_IG03: ;; offset=0x0020
ldr w0, [x1, #0x08]
G_M000_IG04: ;; offset=0x0024
ldp fp, lr, [sp], #0x10
ret lr
G_M000_IG05: ;; offset=0x002C
mov x0, x1
; ............................... 32B boundary ...............................
movz x11, #0x5B0
movk x11, #0x1C02 LSL #16
movk x11, #0xE000 LSL #32
ldr xip0, [x11]
blr xip0
b G_M000_IG04
; Total bytes of code 72
x64:
Main:
; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
push rbp
mov rbp, rsp
G_M000_IG02: ;; offset=0x0004
mov rdi, rsi
mov r11, 0x79AD0B0605B0
call [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
nop
G_M000_IG03: ;; offset=0x0015
pop rbp
ret
; Total bytes of code 23
PR:
; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push rbp
mov rbp, rsp
G_M000_IG02: ;; offset=0x0004
mov rdi, 0x7714F98998B0
cmp qword ptr [rsi], rdi
jne SHORT G_M000_IG05
G_M000_IG03: ;; offset=0x0013
mov eax, dword ptr [rsi+0x08]
G_M000_IG04: ;; offset=0x0016
pop rbp
ret
G_M000_IG05: ;; offset=0x0018
mov rdi, rsi
mov r11, 0x7714F88505B0
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 5) 32B boundary ...............................
call [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
jmp SHORT G_M000_IG04
; Total bytes of code 42
@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?,
Yeah, seems like it might be the cost of forming the constant for the type.
Also interesting that we can't tail call ... need to investigate that. With the advent of CET/CFG tail calling is probably becoming more valuable than it used to be (one less return anyways).
Processing https://github.com/dotnet/runtime/pull/109209#issuecomment-2439624121 command:
Command
-intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner ```cs using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Runtime.CompilerServices; BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args); public class Bench { static string[] Data = new string[512]; [Benchmark] public int Test() => TestInner(Data); [MethodImpl(MethodImplOptions.NoInlining)] int TestInner(ICollection(EgorBot will reply in this issue)