EgorBot for AndyAyersMS in #109209

EgorBot commented 3 weeks ago

Processing https://github.com/dotnet/runtime/pull/109209#issuecomment-2439624121 command:

Command

-intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner ```cs using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Runtime.CompilerServices; BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args); public class Bench { static string[] Data = new string[512]; [Benchmark] public int Test() => TestInner(Data); [MethodImpl(MethodImplOptions.NoInlining)] int TestInner(ICollection c) => c.Count; } ```

(EgorBot will reply in this issue)

EgorBot commented 3 weeks ago

Benchmark results on `Arm64`

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Arm64
  Job-MELZVS : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-YBPLAN : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_JitDisasm=TestInner

Method	Toolchain	Mean	Error	Ratio
Test	Main	1.025 ns	0.0006 ns	1.00
Test	PR	1.082 ns	0.0007 ns	1.06

BDN_Artifacts.zip

Profile for `Bench_Test`:

Flame graphs: Main vs PR 🔥 Speedscope: Main vs PR Hot asm: Main vs PR Hot functions: Main vs PR Counters: Main vs PR

EgorBot commented 3 weeks ago

cc @AndyAyersMS (logs)

EgorBot commented 3 weeks ago

Benchmark results on `Intel`

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 8 logical and 4 physical cores
  Job-FXRMOH : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-TQFOQG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_JitDisasm=TestInner

Method	Toolchain	Mean	Error	Ratio
Test	Main	4.0075 ns	0.0021 ns	1.00
Test	PR	0.6158 ns	0.0013 ns	0.15

BDN_Artifacts.zip

Profile for `Bench_Test`:

Flame graphs: Main vs PR 🔥 Speedscope: Main vs PR Hot asm: Main vs PR Hot functions: Main vs PR Counters: Main vs PR

EgorBot commented 3 weeks ago

cc @AndyAyersMS (logs)

EgorBo commented 3 weeks ago

@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?, JitDisasm.asm output (see BDN_Artifacts.zip) does show that PR has a different codegen:

Main:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

G_M000_IG02:                ;; offset=0x0008
            mov     x0, x1
            movz    x11, #0x5B0
; ............................... 32B boundary ...............................
            movk    x11, #0xB805 LSL #16
            movk    x11, #0xFAC8 LSL #32
            ldr     xip0, [x11]
            blr     xip0

G_M000_IG03:                ;; offset=0x0020
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 40

PR:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

G_M000_IG02:                ;; offset=0x0008
            ldr     x0, [x1]
            movz    x11, #0x9088
; ............................... 32B boundary ...............................
            movk    x11, #0x1D1C LSL #16
            movk    x11, #0xE000 LSL #32
            cmp     x0, x11
            bne     G_M000_IG05

G_M000_IG03:                ;; offset=0x0020
            ldr     w0, [x1, #0x08]

G_M000_IG04:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr

G_M000_IG05:                ;; offset=0x002C
            mov     x0, x1
; ............................... 32B boundary ...............................
            movz    x11, #0x5B0
            movk    x11, #0x1C02 LSL #16
            movk    x11, #0xE000 LSL #32
            ldr     xip0, [x11]
            blr     xip0
            b       G_M000_IG04

; Total bytes of code 72

EgorBo commented 3 weeks ago

x64:

Main:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       mov      rbp, rsp

G_M000_IG02:                ;; offset=0x0004
       mov      rdi, rsi
       mov      r11, 0x79AD0B0605B0
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       nop      

G_M000_IG03:                ;; offset=0x0015
       pop      rbp
       ret      

; Total bytes of code 23

PR:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       mov      rbp, rsp

G_M000_IG02:                ;; offset=0x0004
       mov      rdi, 0x7714F98998B0
       cmp      qword ptr [rsi], rdi
       jne      SHORT G_M000_IG05

G_M000_IG03:                ;; offset=0x0013
       mov      eax, dword ptr [rsi+0x08]

G_M000_IG04:                ;; offset=0x0016
       pop      rbp
       ret      

G_M000_IG05:                ;; offset=0x0018
       mov      rdi, rsi
       mov      r11, 0x7714F88505B0
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 5) 32B boundary ...............................
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       jmp      SHORT G_M000_IG04

; Total bytes of code 42

AndyAyersMS commented 3 weeks ago

@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?,

Yeah, seems like it might be the cost of forming the constant for the type.

Also interesting that we can't tail call ... need to investigate that. With the advent of CET/CFG tail calling is probably becoming more valuable than it used to be (one less return anyways).

EgorBot / runtime-utils