dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.69k stars 4.59k forks source link

for loop performance regression #104655

Open SystematicChaos012 opened 3 weeks ago

SystematicChaos012 commented 3 weeks ago

Description

I found a performance regression through benchmark testing of the Loop Optimizations: IV Widening sample.

// Summary

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-SMYUOB : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-VNQOKB : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove MemoryRandomization=True

Method Runtime Count Mean Error StdDev
LoopOptimizations .NET 8.0 100 35.57 ns 0.159 ns 0.149 ns
LoopOptimizations .NET 9.0 100 37.21 ns 0.379 ns 0.354 ns
LoopOptimizations .NET 8.0 1000 283.27 ns 3.790 ns 3.545 ns
LoopOptimizations .NET 9.0 1000 306.69 ns 2.008 ns 1.878 ns
LoopOptimizations .NET 8.0 10000 2,742.63 ns 9.215 ns 8.620 ns
LoopOptimizations .NET 9.0 10000 3,013.27 ns 36.646 ns 34.278 ns

Reproduction Steps

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

var config = DefaultConfig.Instance
    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80))
    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90));

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);

[MemoryRandomization]
public class Benchmark
{
    private int[] _arr = null!;

    [Params(100, 1000, 10000)]
    public int Count { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        _arr = new int[Count]; 
    }

    [Benchmark]
    public void LoopOptimizations()
    {
        int[] arr = _arr;

        Sum(arr);
    }

    private static int Sum(int[] arr)
    {
        int sum = 0;
        for (int i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }
}

Expected behavior

.NET 9 should be faster or at least as fast as .NET 8.

Actual behavior

.NET 9 is slower than .NET 8.

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

dotnet-policy-service[bot] commented 3 weeks ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

huoyaoyuan commented 3 weeks ago

Note that such intensive loops are sensitive with code/data alignment. You can add [MemoryRandomization] to see if the result changes in any form.

SystematicChaos012 commented 3 weeks ago

Note that such intensive loops are sensitive with code/data alignment. You can add [MemoryRandomization] to see if the result changes in any form.

Yes, I retested using the method you provided, and the results are still the same. I have updated the comment.

jakobbotsch commented 3 weeks ago

I was about to make the same comment as @huoyaoyuan. There are likely micro-architectural effects here causing the difference. E.g. on my laptop I get

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22635.3858) Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2 Job-SQBFGS : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2 Job-MCCPZY : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

Method Runtime Count Mean Error StdDev
LoopOptimizations .NET 8.0 100 35.33 ns 0.615 ns 0.821 ns
LoopOptimizations .NET 9.0 100 36.55 ns 0.590 ns 0.552 ns
LoopOptimizations .NET 8.0 1000 298.34 ns 5.909 ns 9.873 ns
LoopOptimizations .NET 9.0 1000 294.27 ns 5.840 ns 7.594 ns
LoopOptimizations .NET 8.0 10000 2,792.73 ns 45.144 ns 37.697 ns
LoopOptimizations .NET 9.0 10000 2,818.92 ns 35.169 ns 31.177 ns

But I can also see some bimodality with the benchmark:

-------------------- Histogram --------------------
[2.714 us ; 2.816 us) | @@@@@@@@@@
[2.816 us ; 2.895 us) | @@@
---------------------------------------------------

(not as large of a difference as yours, but still a ~7% difference in perf from run-to-run)

One thing I noticed looking at the disassembly is that we have additional prolog in .NET 9, which affects the loop's relative starting offset.

Method Program:Sum(int[]):int (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX - Windows
 ; FullOpts code
@@ -7,8 +166,9 @@
 ; No PGO data

 G_M000_IG01:                ;; offset=0x0000
+       sub      rsp, 40

-G_M000_IG02:                ;; offset=0x0000
+G_M000_IG02:                ;; offset=0x0004
        xor      eax, eax
        xor      edx, edx
        mov      r8d, dword ptr [rcx+0x08]
@@ -16,15 +176,16 @@ G_M000_IG02:                ;; offset=0x0000
        jle      SHORT G_M000_IG04
        align    [0 bytes for IG03]

-G_M000_IG03:                ;; offset=0x000D
+G_M000_IG03:                ;; offset=0x0011
-       mov      r10d, edx
-       add      eax, dword ptr [rcx+4*r10+0x10]
+       add      eax, dword ptr [rcx+4*rdx+0x10]
        inc      edx
        cmp      r8d, edx
        jg       SHORT G_M000_IG03

 G_M000_IG04:                ;; offset=0x001C
+       add      rsp, 40
        ret      

-; Total bytes of code 29
+; Total bytes of code 33

I opened #104658 for that.

jakobbotsch commented 3 weeks ago

Can you try measuring the following version of the loop on your CPU?

    private static int Sum(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

(I changed i from int to nint)

SystematicChaos012 commented 3 weeks ago

@jakobbotsch Yes, the test results show that .NET 9 is faster than .NET 8.

// Summary

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-BEMAVR : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-TZXKAF : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove MemoryRandomization=True

Method Runtime Count Mean Error StdDev
LoopOptimizations .NET 8.0 100 34.87 ns 0.590 ns 0.552 ns
LoopOptimizations .NET 9.0 100 34.08 ns 0.225 ns 0.210 ns
LoopOptimizations .NET 8.0 1000 232.28 ns 1.543 ns 1.443 ns
LoopOptimizations .NET 9.0 1000 231.85 ns 1.279 ns 1.196 ns
LoopOptimizations .NET 8.0 10000 2,254.78 ns 44.108 ns 49.026 ns
LoopOptimizations .NET 9.0 10000 2,244.34 ns 33.817 ns 31.632 ns
jakobbotsch commented 3 weeks ago

Can you also compare to the following?

private static int Sum3(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i++)
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}
SystematicChaos012 commented 3 weeks ago

@jakobbotsch // Summary

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-RMLLKT : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-OAQRLB : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove MemoryRandomization=True

Method Runtime Count Mean Error StdDev
LoopOptimizations_nint .NET 8.0 100 35.00 ns 0.576 ns 0.538 ns
LoopOptimizations_nint_Unsafe .NET 8.0 100 27.53 ns 0.329 ns 0.308 ns
LoopOptimizations_nint .NET 9.0 100 34.48 ns 0.444 ns 0.415 ns
LoopOptimizations_nint_Unsafe .NET 9.0 100 27.47 ns 0.342 ns 0.320 ns
LoopOptimizations_nint .NET 8.0 1000 234.83 ns 3.291 ns 3.078 ns
LoopOptimizations_nint_Unsafe .NET 8.0 1000 219.10 ns 3.507 ns 3.280 ns
LoopOptimizations_nint .NET 9.0 1000 232.04 ns 1.633 ns 1.528 ns
LoopOptimizations_nint_Unsafe .NET 9.0 1000 217.27 ns 0.859 ns 0.803 ns
LoopOptimizations_nint .NET 8.0 10000 2,223.50 ns 43.215 ns 44.379 ns
LoopOptimizations_nint_Unsafe .NET 8.0 10000 2,205.57 ns 12.639 ns 11.823 ns
LoopOptimizations_nint .NET 9.0 10000 2,227.77 ns 21.852 ns 20.440 ns
LoopOptimizations_nint_Unsafe .NET 9.0 10000 2,196.13 ns 6.441 ns 6.025 ns

It's faster.

[Benchmark]
public void LoopOptimizations_nint()
{
    int[] arr = _arr;

    Sum_nint(arr);
}

[Benchmark]
public void LoopOptimizations_nint_Unsafe()
{
    int[] arr = _arr;

    Sum_nint_Unsafe(arr);
}

private static int Sum_nint(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i++)
    {
        sum += arr[i];
    }

    return sum;
}

private static int Sum_nint_Unsafe(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i++)
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}
jakobbotsch commented 3 weeks ago

Thank you for measuring! Those are some interesting results. Here's another couple of variants if you don't mind:

private static int Sum_nint_unsafe_unwidened(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}

private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
    int sum = 0;
    for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}

private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
    int sum = 0;
    for (nint i = 0; (uint)i < (uint)arr.Length; i++)
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}
Here's some properties of the codegen of each version: i widened i++ widened i < arr.Length widened Bounds check Relative loop offset
int Yes No No No 0x11
nint Yes Yes Yes Yes 0x20 (aligned by JIT)
nint_unsafe Yes Yes Yes No 0xD
nint_unsafe_unwidened Yes No Yes No 0xD
nint_unsafe_unwidened_unwidened Yes No No No 0xD
nint_unsafe_unwidened_widened Yes Yes No No 0xD

I am trying to figure out whether your CPU is benefiting from the specific alignment of the loop (starting at 0xD) or whether it is benefitting from the widened i++/compare in the loop. IV widening does not widen the i++ or compare operations of the loop, so if this is what is benefitting your CPU, then we should consider having the JIT widen the operations when possible, even though this is a size increase in the codegen.

SystematicChaos012 commented 3 weeks ago

@jakobbotsch No problem, this is the latest test result.

// Summary

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-DLCMFL : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-EKWPPY : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove MemoryRandomization=True

Method Runtime Count Mean Error StdDev
LoopOptimizations_int .NET 8.0 100 36.94 ns 0.392 ns 0.366 ns
LoopOptimizations_nint .NET 8.0 100 35.23 ns 0.124 ns 0.116 ns
LoopOptimizations_nint_unsafe .NET 8.0 100 28.00 ns 0.148 ns 0.138 ns
LoopOptimizations_nint_unsafe_unwidened .NET 8.0 100 38.26 ns 0.166 ns 0.156 ns
LoopOptimizations_nint_unsafe_unwidened_unwidened .NET 8.0 100 38.44 ns 0.271 ns 0.254 ns
LoopOptimizations_int .NET 9.0 100 37.45 ns 0.306 ns 0.286 ns
LoopOptimizations_nint .NET 9.0 100 35.03 ns 0.216 ns 0.202 ns
LoopOptimizations_nint_unsafe .NET 9.0 100 28.23 ns 0.258 ns 0.242 ns
LoopOptimizations_nint_unsafe_unwidened .NET 9.0 100 38.26 ns 0.224 ns 0.209 ns
LoopOptimizations_nint_unsafe_unwidened_unwidened .NET 9.0 100 38.29 ns 0.178 ns 0.167 ns
LoopOptimizations_int .NET 8.0 1000 289.70 ns 1.718 ns 1.607 ns
LoopOptimizations_nint .NET 8.0 1000 239.52 ns 2.086 ns 1.951 ns
LoopOptimizations_nint_unsafe .NET 8.0 1000 222.10 ns 0.598 ns 0.560 ns
LoopOptimizations_nint_unsafe_unwidened .NET 8.0 1000 311.14 ns 2.234 ns 2.090 ns
LoopOptimizations_nint_unsafe_unwidened_unwidened .NET 8.0 1000 310.26 ns 0.891 ns 0.833 ns
LoopOptimizations_int .NET 9.0 1000 311.64 ns 1.064 ns 0.995 ns
LoopOptimizations_nint .NET 9.0 1000 252.78 ns 4.829 ns 4.517 ns
LoopOptimizations_nint_unsafe .NET 9.0 1000 223.81 ns 4.492 ns 4.411 ns
LoopOptimizations_nint_unsafe_unwidened .NET 9.0 1000 320.45 ns 5.811 ns 5.436 ns
LoopOptimizations_nint_unsafe_unwidened_unwidened .NET 9.0 1000 311.03 ns 1.275 ns 1.193 ns
LoopOptimizations_int .NET 8.0 10000 2,834.42 ns 14.483 ns 13.547 ns
LoopOptimizations_nint .NET 8.0 10000 2,310.24 ns 30.116 ns 28.170 ns
LoopOptimizations_nint_unsafe .NET 8.0 10000 2,244.80 ns 5.507 ns 5.152 ns
LoopOptimizations_nint_unsafe_unwidened .NET 8.0 10000 3,047.49 ns 9.025 ns 8.442 ns
LoopOptimizations_nint_unsafe_unwidened_unwidened .NET 8.0 10000 3,044.17 ns 7.777 ns 7.275 ns
LoopOptimizations_int .NET 9.0 10000 3,044.72 ns 10.149 ns 9.493 ns
LoopOptimizations_nint .NET 9.0 10000 2,307.03 ns 11.474 ns 10.732 ns
LoopOptimizations_nint_unsafe .NET 9.0 10000 2,250.03 ns 11.906 ns 11.137 ns
LoopOptimizations_nint_unsafe_unwidened .NET 9.0 10000 3,046.68 ns 9.379 ns 8.773 ns
LoopOptimizations_nint_unsafe_unwidened_unwidened .NET 9.0 10000 3,050.47 ns 8.364 ns 7.823 ns
[Benchmark]
public void LoopOptimizations_int()
{
    int[] arr = _arr;

    Sum_int(arr);
}

[Benchmark]
public void LoopOptimizations_nint()
{
    int[] arr = _arr;

    Sum_nint(arr);
}

[Benchmark]
public void LoopOptimizations_nint_unsafe()
{
    int[] arr = _arr;

    Sum_nint_unsafe(arr);
}

[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened()
{
    int[] arr = _arr;

    Sum_nint_unsafe_unwidened(arr);
}

[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
{
    int[] arr = _arr;

    Sum_nint_unsafe_unwidened_unwidened(arr);
}

private static int Sum_int(int[] arr)
{
    int sum = 0;
    for (int i = 0; i < arr.Length; i++)
    {
        sum += arr[i];
    }

    return sum;
}

private static int Sum_nint(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i++)
    {
        sum += arr[i];
    }

    return sum;
}

private static int Sum_nint_unsafe(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i++)
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}

private static int Sum_nint_unsafe_unwidened(int[] arr)
{
    int sum = 0;
    for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}

private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
    int sum = 0;
    for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}
jakobbotsch commented 3 weeks ago

Interesting -- it seems your CPU really benefits from having the i++ and compare widened as well. Actually, I am not sure whether it is just the i++ or also the compare that should be widened -- do you mind measuring the following variant too? This one only widens the i++. Hopefully the last one this time :-)

private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
    int sum = 0;
    for (nint i = 0; (uint)i < (uint)arr.Length; i++)
    {
        sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
    }

    return sum;
}
SystematicChaos012 commented 3 weeks ago

@jakobbotsch

I hope the following content helps you.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-HGUYGD : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-YOTCVU : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove MemoryRandomization=True

Method Runtime Count Mean Error StdDev
ForLoop_int .NET 8.0 100 36.08 ns 0.562 ns 0.526 ns
ForLoop_nint .NET 8.0 100 34.68 ns 0.687 ns 0.735 ns
ForLoop_nint_unsafe .NET 8.0 100 27.53 ns 0.260 ns 0.243 ns
ForLoop_nint_unsafe_unwidened .NET 8.0 100 37.74 ns 0.445 ns 0.417 ns
ForLoop_nint_unsafe_unwidened_unwidened .NET 8.0 100 38.11 ns 0.790 ns 0.776 ns
ForLoop_nint_unsafe_unwidened_widened .NET 8.0 100 27.69 ns 0.297 ns 0.278 ns
ForLoop_int .NET 9.0 100 37.92 ns 0.787 ns 0.967 ns
ForLoop_nint .NET 9.0 100 34.83 ns 0.526 ns 0.492 ns
ForLoop_nint_unsafe .NET 9.0 100 27.86 ns 0.463 ns 0.433 ns
ForLoop_nint_unsafe_unwidened .NET 9.0 100 37.47 ns 0.370 ns 0.346 ns
ForLoop_nint_unsafe_unwidened_unwidened .NET 9.0 100 38.08 ns 0.505 ns 0.472 ns
ForLoop_nint_unsafe_unwidened_widened .NET 9.0 100 27.49 ns 0.253 ns 0.236 ns
ForLoop_int .NET 8.0 1000 284.28 ns 2.108 ns 1.972 ns
ForLoop_nint .NET 8.0 1000 234.87 ns 3.662 ns 3.426 ns
ForLoop_nint_unsafe .NET 8.0 1000 220.50 ns 3.959 ns 3.703 ns
ForLoop_nint_unsafe_unwidened .NET 8.0 1000 306.49 ns 0.977 ns 0.914 ns
ForLoop_nint_unsafe_unwidened_unwidened .NET 8.0 1000 308.39 ns 3.871 ns 3.621 ns
ForLoop_nint_unsafe_unwidened_widened .NET 8.0 1000 218.25 ns 0.508 ns 0.476 ns
ForLoop_int .NET 9.0 1000 307.51 ns 1.491 ns 1.394 ns
ForLoop_nint .NET 9.0 1000 232.28 ns 0.539 ns 0.504 ns
ForLoop_nint_unsafe .NET 9.0 1000 219.25 ns 1.951 ns 1.825 ns
ForLoop_nint_unsafe_unwidened .NET 9.0 1000 309.28 ns 3.497 ns 3.271 ns
ForLoop_nint_unsafe_unwidened_unwidened .NET 9.0 1000 308.69 ns 5.287 ns 4.946 ns
ForLoop_nint_unsafe_unwidened_widened .NET 9.0 1000 220.08 ns 1.661 ns 1.554 ns
ForLoop_int .NET 8.0 10000 2,781.48 ns 19.513 ns 18.252 ns
ForLoop_nint .NET 8.0 10000 2,236.87 ns 22.613 ns 21.152 ns
ForLoop_nint_unsafe .NET 8.0 10000 2,294.69 ns 45.756 ns 48.959 ns
ForLoop_nint_unsafe_unwidened .NET 8.0 10000 3,014.57 ns 40.097 ns 37.507 ns
ForLoop_nint_unsafe_unwidened_unwidened .NET 8.0 10000 3,030.63 ns 44.037 ns 41.193 ns
ForLoop_nint_unsafe_unwidened_widened .NET 8.0 10000 2,216.69 ns 34.966 ns 32.707 ns
ForLoop_int .NET 9.0 10000 3,000.21 ns 10.061 ns 9.411 ns
ForLoop_nint .NET 9.0 10000 2,227.37 ns 12.335 ns 11.538 ns
ForLoop_nint_unsafe .NET 9.0 10000 2,214.29 ns 25.932 ns 24.257 ns
ForLoop_nint_unsafe_unwidened .NET 9.0 10000 3,002.29 ns 16.117 ns 15.076 ns
ForLoop_nint_unsafe_unwidened_unwidened .NET 9.0 10000 3,024.86 ns 40.873 ns 38.233 ns
ForLoop_nint_unsafe_unwidened_widened .NET 9.0 10000 2,212.96 ns 29.958 ns 28.023 ns
jakobbotsch commented 3 weeks ago

Thanks for running those measurements. I still am unable to reproduce the results, even on my own Intel CPU:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22635.3858) Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-JKDKMJ : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

Runtime=.NET 9.0

Method Count Mean Error StdDev Ratio
LoopOptimizations_int 10000 2.766 us 0.0204 us 0.0160 us 1.00
LoopOptimizations_nint 10000 2.459 us 0.0356 us 0.0333 us 0.89
LoopOptimizations_nint_unsafe 10000 2.730 us 0.0350 us 0.0327 us 0.99
LoopOptimizations_nint_unsafe_unwidened 10000 2.732 us 0.0328 us 0.0307 us 0.99
LoopOptimizations_nint_unsafe_unwidened_unwidened 10000 2.754 us 0.0187 us 0.0166 us 1.00
LoopOptimizations_nint_unsafe_unwidened_widened 10000 2.739 us 0.0105 us 0.0087 us 0.99

LoopOptimizations_nint being faster is #104665.

I still have my suspicions that there is some micro architectural artifact here... Going to see if I can run the benchmarks on some more CPUs.

jakobbotsch commented 3 weeks ago

@EgorBot -arm64 -intel -amd -commit 42b2b19e883f06af5771b5d85b26af263c62e781 --disasm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public class Benchmark
{
    private int[] _arr = null!;

    [Params(10000)]
    public int Count { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        _arr = new int[Count];
    }

    [Benchmark(Baseline = true)]
    public void LoopOptimizations_int()
    {
        int[] arr = _arr;

        Sum_int(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint()
    {
        int[] arr = _arr;

        Sum_nint(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe()
    {
        int[] arr = _arr;

        Sum_nint_unsafe(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_widened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_widened(arr);
    }

    private static int Sum_int(int[] arr)
    {
        int sum = 0;
        for (int i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint_unsafe(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }
}
jakobbotsch commented 3 weeks ago

@EgorBot -arm64 -intel -amd -commit 42b2b19e883f06af5771b5d85b26af263c62e781 vs c09ec6552f11b74a2e825cb63cb7c45f5552d3f2 --disasm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public class Benchmark
{
    private int[] _arr = null!;

    [Params(10000)]
    public int Count { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        _arr = new int[Count];
    }

    [Benchmark(Baseline = true)]
    public void LoopOptimizations_int()
    {
        int[] arr = _arr;

        Sum_int(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint()
    {
        int[] arr = _arr;

        Sum_nint(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe()
    {
        int[] arr = _arr;

        Sum_nint_unsafe(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_widened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_widened(arr);
    }

    private static int Sum_int(int[] arr)
    {
        int sum = 0;
        for (int i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint_unsafe(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }
}
EgorBot commented 3 weeks ago

❌ Not more than two CPU architectures at once, please.

jakobbotsch commented 3 weeks ago

@EgorBot -intel -amd -commit 42b2b19e883f06af5771b5d85b26af263c62e781 vs c09ec6552f11b74a2e825cb63cb7c45f5552d3f2 --disasm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public class Benchmark
{
    private int[] _arr = null!;

    [Params(10000)]
    public int Count { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        _arr = new int[Count];
    }

    [Benchmark(Baseline = true)]
    public void LoopOptimizations_int()
    {
        int[] arr = _arr;

        Sum_int(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint()
    {
        int[] arr = _arr;

        Sum_nint(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe()
    {
        int[] arr = _arr;

        Sum_nint_unsafe(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_widened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_widened(arr);
    }

    private static int Sum_int(int[] arr)
    {
        int sum = 0;
        for (int i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint_unsafe(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }
}
EgorBot commented 3 weeks ago
Benchmark results on Intel ``` BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores Job-SJLUKK : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-CLRDAW : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ``` | Method | Toolchain | Count | Mean | Error | Ratio | Code Size | |-------------------------------------------------- |------------------------ |------ |---------:|----------:|------:|----------:| | LoopOptimizations_int | Main | 10000 | 4.131 μs | 0.0102 μs | 1.00 | 44 B | | LoopOptimizations_nint | Main | 10000 | 5.745 μs | 0.0015 μs | 1.39 | 58 B | | LoopOptimizations_nint_unsafe | Main | 10000 | 4.125 μs | 0.0043 μs | 1.00 | 47 B | | LoopOptimizations_nint_unsafe_unwidened | Main | 10000 | 4.122 μs | 0.0004 μs | 1.00 | 46 B | | LoopOptimizations_nint_unsafe_unwidened_unwidened | Main | 10000 | 4.122 μs | 0.0006 μs | 1.00 | 44 B | | LoopOptimizations_nint_unsafe_unwidened_widened | Main | 10000 | 4.125 μs | 0.0007 μs | 1.00 | 45 B | | LoopOptimizations_int | PR | 10000 | 4.121 μs | 0.0006 μs | 1.00 | 44 B | | LoopOptimizations_nint | PR | 10000 | 4.637 μs | 0.0032 μs | 1.12 | 58 B | | LoopOptimizations_nint_unsafe | PR | 10000 | 4.122 μs | 0.0006 μs | 1.00 | 47 B | | LoopOptimizations_nint_unsafe_unwidened | PR | 10000 | 4.122 μs | 0.0004 μs | 1.00 | 46 B | | LoopOptimizations_nint_unsafe_unwidened_unwidened | PR | 10000 | 4.121 μs | 0.0007 μs | 1.00 | 44 B | | LoopOptimizations_nint_unsafe_unwidened_widened | PR | 10000 | 4.122 μs | 0.0012 μs | 1.00 | 45 B | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_c13bf1a5.zip)
EgorBot commented 3 weeks ago
Benchmark results on Amd ``` BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 8 logical and 4 physical cores Job-OTZBXR : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-UQDHOF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 ``` | Method | Toolchain | Count | Mean | Error | Ratio | Code Size | |-------------------------------------------------- |------------------------ |------ |---------:|----------:|------:|----------:| | LoopOptimizations_int | Main | 10000 | 3.329 μs | 0.0105 μs | 1.00 | 44 B | | LoopOptimizations_nint | Main | 10000 | 3.110 μs | 0.0007 μs | 0.93 | 58 B | | LoopOptimizations_nint_unsafe | Main | 10000 | 3.212 μs | 0.0007 μs | 0.96 | 47 B | | LoopOptimizations_nint_unsafe_unwidened | Main | 10000 | 3.362 μs | 0.0020 μs | 1.01 | 46 B | | LoopOptimizations_nint_unsafe_unwidened_unwidened | Main | 10000 | 3.218 μs | 0.0018 μs | 0.97 | 44 B | | LoopOptimizations_nint_unsafe_unwidened_widened | Main | 10000 | 3.217 μs | 0.0011 μs | 0.97 | 45 B | | LoopOptimizations_int | PR | 10000 | 3.292 μs | 0.0010 μs | 0.99 | 44 B | | LoopOptimizations_nint | PR | 10000 | 3.111 μs | 0.0004 μs | 0.93 | 58 B | | LoopOptimizations_nint_unsafe | PR | 10000 | 3.315 μs | 0.0014 μs | 1.00 | 47 B | | LoopOptimizations_nint_unsafe_unwidened | PR | 10000 | 3.296 μs | 0.0009 μs | 0.99 | 46 B | | LoopOptimizations_nint_unsafe_unwidened_unwidened | PR | 10000 | 3.222 μs | 0.0015 μs | 0.97 | 44 B | | LoopOptimizations_nint_unsafe_unwidened_widened | PR | 10000 | 3.218 μs | 0.0014 μs | 0.97 | 45 B | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_61ae7269.zip)
jakobbotsch commented 3 weeks ago

@EgorBot -arm64 -commit 42b2b19e883f06af5771b5d85b26af263c62e781 vs c09ec6552f11b74a2e825cb63cb7c45f5552d3f2 --disasm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public class Benchmark
{
    private int[] _arr = null!;

    [Params(10000)]
    public int Count { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        _arr = new int[Count];
    }

    [Benchmark(Baseline = true)]
    public void LoopOptimizations_int()
    {
        int[] arr = _arr;

        Sum_int(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint()
    {
        int[] arr = _arr;

        Sum_nint(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe()
    {
        int[] arr = _arr;

        Sum_nint_unsafe(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_unwidened(arr);
    }

    [Benchmark]
    public void LoopOptimizations_nint_unsafe_unwidened_widened()
    {
        int[] arr = _arr;

        Sum_nint_unsafe_unwidened_widened(arr);
    }

    private static int Sum_int(int[] arr)
    {
        int sum = 0;
        for (int i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    private static int Sum_nint_unsafe(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }

    private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }
}
EgorBot commented 3 weeks ago
Benchmark results on Arm64 ``` BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Unknown processor Job-LFHORP : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD Job-YTDXRY : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD ``` | Method | Toolchain | Count | Mean | Error | Ratio | Code Size | |-------------------------------------------------- |------------------------ |------ |---------:|----------:|------:|----------:| | LoopOptimizations_int | Main | 10000 | 5.733 μs | 0.0006 μs | 1.00 | 104 B | | LoopOptimizations_nint | Main | 10000 | 6.818 μs | 0.0007 μs | 1.19 | 120 B | | LoopOptimizations_nint_unsafe | Main | 10000 | 5.617 μs | 0.0005 μs | 0.98 | 108 B | | LoopOptimizations_nint_unsafe_unwidened | Main | 10000 | 8.958 μs | 0.0007 μs | 1.56 | 112 B | | LoopOptimizations_nint_unsafe_unwidened_unwidened | Main | 10000 | 8.958 μs | 0.0004 μs | 1.56 | 112 B | | LoopOptimizations_nint_unsafe_unwidened_widened | Main | 10000 | 5.617 μs | 0.0006 μs | 0.98 | 108 B | | LoopOptimizations_int | PR | 10000 | 5.733 μs | 0.0006 μs | 1.00 | 104 B | | LoopOptimizations_nint | PR | 10000 | 6.818 μs | 0.0008 μs | 1.19 | 120 B | | LoopOptimizations_nint_unsafe | PR | 10000 | 5.618 μs | 0.0008 μs | 0.98 | 108 B | | LoopOptimizations_nint_unsafe_unwidened | PR | 10000 | 8.959 μs | 0.0007 μs | 1.56 | 112 B | | LoopOptimizations_nint_unsafe_unwidened_unwidened | PR | 10000 | 8.957 μs | 0.0005 μs | 1.56 | 112 B | | LoopOptimizations_nint_unsafe_unwidened_widened | PR | 10000 | 5.617 μs | 0.0004 μs | 0.98 | 108 B | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_b01d82e6.zip)
SystematicChaos012 commented 3 weeks ago

@jakobbotsch

I don't know how else I can help you from my side. I tried using another computer with a 13600KF CPU, and these are my results.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3) 13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job-YKBOTF : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove Runtime=.NET 9.0 MemoryRandomization=True

Method Count Mean Error StdDev Median Ratio RatioSD
LoopOptimizations_int 10000 2.724 us 0.0181 us 0.0169 us 2.720 us 1.00 0.00
LoopOptimizations_nint 10000 3.188 us 0.4768 us 1.4057 us 2.005 us 1.01 0.48
LoopOptimizations_nint_unsafe 10000 1.983 us 0.0074 us 0.0070 us 1.983 us 0.73 0.00
LoopOptimizations_nint_unsafe_unwidened 10000 2.691 us 0.0032 us 0.0030 us 2.689 us 0.99 0.01
LoopOptimizations_nint_unsafe_unwidened_unwidened 10000 2.689 us 0.0015 us 0.0014 us 2.689 us 0.99 0.01
LoopOptimizations_nint_unsafe_unwidened_widened 10000 1.974 us 0.0019 us 0.0018 us 1.973 us 0.72 0.00
huoyaoyuan commented 3 weeks ago

I tried using another computer with a 13600KF CPU, and these are my results.

Kindly reminder to be aware of the asymetric P/E cores. Keep your benchmark window in foreground or set affinity.

SystematicChaos012 commented 3 weeks ago

Sorry, I set the affinity to CPU 0-5 and kept the window in the foreground. I ran the test again.

13600KF


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.100-preview.6.24328.19
  [Host]     : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
  Job-YUEOQM : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove  Runtime=.NET 9.0  MemoryRandomization=True  
Method Count Mean Error StdDev Ratio
LoopOptimizations_int 10000 2.704 us 0.0139 us 0.0130 us 1.00
LoopOptimizations_nint 10000 1.990 us 0.0036 us 0.0034 us 0.74
LoopOptimizations_nint_unsafe 10000 1.976 us 0.0026 us 0.0024 us 0.73
LoopOptimizations_nint_unsafe_unwidened 10000 2.691 us 0.0037 us 0.0035 us 1.00
LoopOptimizations_nint_unsafe_unwidened_unwidened 10000 2.693 us 0.0046 us 0.0043 us 1.00
LoopOptimizations_nint_unsafe_unwidened_widened 10000 1.978 us 0.0039 us 0.0036 us 0.73

12500 (Only put window in foreground)


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-preview.6.24328.19
  [Host]     : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
  Job-PVQIZI : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2

OutlierMode=DontRemove  Runtime=.NET 9.0  MemoryRandomization=True
Method Count Mean Error StdDev Ratio
LoopOptimizations_int 10000 3.071 us 0.0138 us 0.0129 us 1.00
LoopOptimizations_nint 10000 2.328 us 0.0283 us 0.0265 us 0.76
LoopOptimizations_nint_unsafe 10000 2.252 us 0.0171 us 0.0160 us 0.73
LoopOptimizations_nint_unsafe_unwidened 10000 3.050 us 0.0064 us 0.0059 us 0.99
LoopOptimizations_nint_unsafe_unwidened_unwidened 10000 3.045 us 0.0101 us 0.0095 us 0.99
LoopOptimizations_nint_unsafe_unwidened_widened 10000 2.251 us 0.0101 us 0.0095 us 0.73
jakobbotsch commented 2 weeks ago

@SystematicChaos012 Thanks a lot for helping to investigate this.

I'm curious, does the difference reproduce for you with a simple standalone app? Make sure to set the environment variable DOTNET_TieredCompilation=0 before running this.

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

class Program
{
    static void Main(string[] args)
    {
        int[] arr = new int[10000];
        Stopwatch timer = new Stopwatch();

        for (int i = 0; i < 10; i++)
        {
            timer.Restart();
            for (int j = 0; j < 100000; j++)
            {
                Sum_nint_unsafe_unwidened_widened(arr);
            }
            Console.WriteLine("Sum_nint_unsafe_unwidened_widened: {0:F2} us per invoc", timer.Elapsed.TotalMilliseconds * 1000 / 100000);
        }

        for (int i = 0; i < 10; i++)
        {
            timer.Restart();
            for (int j = 0; j < 100000; j++)
            {
                Sum_int(arr);
            }
            Console.WriteLine("Sum_int: {0:F2} us per invoc", timer.Elapsed.TotalMilliseconds * 1000 / 100000);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static int Sum_int(int[] arr)
    {
        int sum = 0;
        for (int i = 0; i < arr.Length; i++)
        {
            sum += arr[i];
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
    {
        int sum = 0;
        for (nint i = 0; (uint)i < (uint)arr.Length; i++)
        {
            sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
        }

        return sum;
    }
}

If it does and you wouldn't mind then it would be very helpful if you can compare the VTune microarchitectural profile traces of the different versions. The difference in those two traces should help us definitively try to figure out whether the CPU is benefitting from something due to the widening, or whether there are any other effects (like, say, just the code size differences).

SystematicChaos012 commented 2 weeks ago

ForLoop.tar.gz

@jakobbotsch I have packaged the results, this is my first time using the VTune. I hope I haven't made any mistakes.

jakobbotsch commented 1 week ago

Thanks a lot for those results. I will try to take a look at them soon, however, I think the original loop perf should be fixed in preview 7 by virtue of strength reduction. Feel free to try that out once preview 7 is available.