Open SystematicChaos012 opened 3 weeks ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Note that such intensive loops are sensitive with code/data alignment. You can add [MemoryRandomization]
to see if the result changes in any form.
Note that such intensive loops are sensitive with code/data alignment. You can add
[MemoryRandomization]
to see if the result changes in any form.
Yes, I retested using the method you provided, and the results are still the same. I have updated the comment.
I was about to make the same comment as @huoyaoyuan. There are likely micro-architectural effects here causing the difference. E.g. on my laptop I get
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22635.3858) Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2 Job-SQBFGS : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2 Job-MCCPZY : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
Method | Runtime | Count | Mean | Error | StdDev |
---|---|---|---|---|---|
LoopOptimizations | .NET 8.0 | 100 | 35.33 ns | 0.615 ns | 0.821 ns |
LoopOptimizations | .NET 9.0 | 100 | 36.55 ns | 0.590 ns | 0.552 ns |
LoopOptimizations | .NET 8.0 | 1000 | 298.34 ns | 5.909 ns | 9.873 ns |
LoopOptimizations | .NET 9.0 | 1000 | 294.27 ns | 5.840 ns | 7.594 ns |
LoopOptimizations | .NET 8.0 | 10000 | 2,792.73 ns | 45.144 ns | 37.697 ns |
LoopOptimizations | .NET 9.0 | 10000 | 2,818.92 ns | 35.169 ns | 31.177 ns |
But I can also see some bimodality with the benchmark:
-------------------- Histogram --------------------
[2.714 us ; 2.816 us) | @@@@@@@@@@
[2.816 us ; 2.895 us) | @@@
---------------------------------------------------
(not as large of a difference as yours, but still a ~7% difference in perf from run-to-run)
One thing I noticed looking at the disassembly is that we have additional prolog in .NET 9, which affects the loop's relative starting offset.
Method Program:Sum(int[]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
@@ -7,8 +166,9 @@
; No PGO data
G_M000_IG01: ;; offset=0x0000
+ sub rsp, 40
-G_M000_IG02: ;; offset=0x0000
+G_M000_IG02: ;; offset=0x0004
xor eax, eax
xor edx, edx
mov r8d, dword ptr [rcx+0x08]
@@ -16,15 +176,16 @@ G_M000_IG02: ;; offset=0x0000
jle SHORT G_M000_IG04
align [0 bytes for IG03]
-G_M000_IG03: ;; offset=0x000D
+G_M000_IG03: ;; offset=0x0011
- mov r10d, edx
- add eax, dword ptr [rcx+4*r10+0x10]
+ add eax, dword ptr [rcx+4*rdx+0x10]
inc edx
cmp r8d, edx
jg SHORT G_M000_IG03
G_M000_IG04: ;; offset=0x001C
+ add rsp, 40
ret
-; Total bytes of code 29
+; Total bytes of code 33
I opened #104658 for that.
Can you try measuring the following version of the loop on your CPU?
private static int Sum(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
(I changed i
from int
to nint
)
@jakobbotsch Yes, the test results show that .NET 9 is faster than .NET 8.
// Summary
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-BEMAVR : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-TZXKAF : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove MemoryRandomization=True
Method | Runtime | Count | Mean | Error | StdDev |
---|---|---|---|---|---|
LoopOptimizations | .NET 8.0 | 100 | 34.87 ns | 0.590 ns | 0.552 ns |
LoopOptimizations | .NET 9.0 | 100 | 34.08 ns | 0.225 ns | 0.210 ns |
LoopOptimizations | .NET 8.0 | 1000 | 232.28 ns | 1.543 ns | 1.443 ns |
LoopOptimizations | .NET 9.0 | 1000 | 231.85 ns | 1.279 ns | 1.196 ns |
LoopOptimizations | .NET 8.0 | 10000 | 2,254.78 ns | 44.108 ns | 49.026 ns |
LoopOptimizations | .NET 9.0 | 10000 | 2,244.34 ns | 33.817 ns | 31.632 ns |
Can you also compare to the following?
private static int Sum3(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
@jakobbotsch // Summary
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-RMLLKT : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-OAQRLB : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove MemoryRandomization=True
Method | Runtime | Count | Mean | Error | StdDev |
---|---|---|---|---|---|
LoopOptimizations_nint | .NET 8.0 | 100 | 35.00 ns | 0.576 ns | 0.538 ns |
LoopOptimizations_nint_Unsafe | .NET 8.0 | 100 | 27.53 ns | 0.329 ns | 0.308 ns |
LoopOptimizations_nint | .NET 9.0 | 100 | 34.48 ns | 0.444 ns | 0.415 ns |
LoopOptimizations_nint_Unsafe | .NET 9.0 | 100 | 27.47 ns | 0.342 ns | 0.320 ns |
LoopOptimizations_nint | .NET 8.0 | 1000 | 234.83 ns | 3.291 ns | 3.078 ns |
LoopOptimizations_nint_Unsafe | .NET 8.0 | 1000 | 219.10 ns | 3.507 ns | 3.280 ns |
LoopOptimizations_nint | .NET 9.0 | 1000 | 232.04 ns | 1.633 ns | 1.528 ns |
LoopOptimizations_nint_Unsafe | .NET 9.0 | 1000 | 217.27 ns | 0.859 ns | 0.803 ns |
LoopOptimizations_nint | .NET 8.0 | 10000 | 2,223.50 ns | 43.215 ns | 44.379 ns |
LoopOptimizations_nint_Unsafe | .NET 8.0 | 10000 | 2,205.57 ns | 12.639 ns | 11.823 ns |
LoopOptimizations_nint | .NET 9.0 | 10000 | 2,227.77 ns | 21.852 ns | 20.440 ns |
LoopOptimizations_nint_Unsafe | .NET 9.0 | 10000 | 2,196.13 ns | 6.441 ns | 6.025 ns |
It's faster.
[Benchmark]
public void LoopOptimizations_nint()
{
int[] arr = _arr;
Sum_nint(arr);
}
[Benchmark]
public void LoopOptimizations_nint_Unsafe()
{
int[] arr = _arr;
Sum_nint_Unsafe(arr);
}
private static int Sum_nint(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint_Unsafe(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
Thank you for measuring! Those are some interesting results. Here's another couple of variants if you don't mind:
private static int Sum_nint_unsafe_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
Here's some properties of the codegen of each version: | i widened | i++ widened | i < arr.Length widened | Bounds check | Relative loop offset | |
---|---|---|---|---|---|---|
int | Yes | No | No | No | 0x11 | |
nint | Yes | Yes | Yes | Yes | 0x20 (aligned by JIT) | |
nint_unsafe | Yes | Yes | Yes | No | 0xD | |
nint_unsafe_unwidened | Yes | No | Yes | No | 0xD | |
nint_unsafe_unwidened_unwidened | Yes | No | No | No | 0xD | |
nint_unsafe_unwidened_widened | Yes | Yes | No | No | 0xD |
I am trying to figure out whether your CPU is benefiting from the specific alignment of the loop (starting at 0xD) or whether it is benefitting from the widened i++
/compare in the loop. IV widening does not widen the i++
or compare operations of the loop, so if this is what is benefitting your CPU, then we should consider having the JIT widen the operations when possible, even though this is a size increase in the codegen.
@jakobbotsch No problem, this is the latest test result.
// Summary
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-DLCMFL : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-EKWPPY : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove MemoryRandomization=True
Method | Runtime | Count | Mean | Error | StdDev |
---|---|---|---|---|---|
LoopOptimizations_int | .NET 8.0 | 100 | 36.94 ns | 0.392 ns | 0.366 ns |
LoopOptimizations_nint | .NET 8.0 | 100 | 35.23 ns | 0.124 ns | 0.116 ns |
LoopOptimizations_nint_unsafe | .NET 8.0 | 100 | 28.00 ns | 0.148 ns | 0.138 ns |
LoopOptimizations_nint_unsafe_unwidened | .NET 8.0 | 100 | 38.26 ns | 0.166 ns | 0.156 ns |
LoopOptimizations_nint_unsafe_unwidened_unwidened | .NET 8.0 | 100 | 38.44 ns | 0.271 ns | 0.254 ns |
LoopOptimizations_int | .NET 9.0 | 100 | 37.45 ns | 0.306 ns | 0.286 ns |
LoopOptimizations_nint | .NET 9.0 | 100 | 35.03 ns | 0.216 ns | 0.202 ns |
LoopOptimizations_nint_unsafe | .NET 9.0 | 100 | 28.23 ns | 0.258 ns | 0.242 ns |
LoopOptimizations_nint_unsafe_unwidened | .NET 9.0 | 100 | 38.26 ns | 0.224 ns | 0.209 ns |
LoopOptimizations_nint_unsafe_unwidened_unwidened | .NET 9.0 | 100 | 38.29 ns | 0.178 ns | 0.167 ns |
LoopOptimizations_int | .NET 8.0 | 1000 | 289.70 ns | 1.718 ns | 1.607 ns |
LoopOptimizations_nint | .NET 8.0 | 1000 | 239.52 ns | 2.086 ns | 1.951 ns |
LoopOptimizations_nint_unsafe | .NET 8.0 | 1000 | 222.10 ns | 0.598 ns | 0.560 ns |
LoopOptimizations_nint_unsafe_unwidened | .NET 8.0 | 1000 | 311.14 ns | 2.234 ns | 2.090 ns |
LoopOptimizations_nint_unsafe_unwidened_unwidened | .NET 8.0 | 1000 | 310.26 ns | 0.891 ns | 0.833 ns |
LoopOptimizations_int | .NET 9.0 | 1000 | 311.64 ns | 1.064 ns | 0.995 ns |
LoopOptimizations_nint | .NET 9.0 | 1000 | 252.78 ns | 4.829 ns | 4.517 ns |
LoopOptimizations_nint_unsafe | .NET 9.0 | 1000 | 223.81 ns | 4.492 ns | 4.411 ns |
LoopOptimizations_nint_unsafe_unwidened | .NET 9.0 | 1000 | 320.45 ns | 5.811 ns | 5.436 ns |
LoopOptimizations_nint_unsafe_unwidened_unwidened | .NET 9.0 | 1000 | 311.03 ns | 1.275 ns | 1.193 ns |
LoopOptimizations_int | .NET 8.0 | 10000 | 2,834.42 ns | 14.483 ns | 13.547 ns |
LoopOptimizations_nint | .NET 8.0 | 10000 | 2,310.24 ns | 30.116 ns | 28.170 ns |
LoopOptimizations_nint_unsafe | .NET 8.0 | 10000 | 2,244.80 ns | 5.507 ns | 5.152 ns |
LoopOptimizations_nint_unsafe_unwidened | .NET 8.0 | 10000 | 3,047.49 ns | 9.025 ns | 8.442 ns |
LoopOptimizations_nint_unsafe_unwidened_unwidened | .NET 8.0 | 10000 | 3,044.17 ns | 7.777 ns | 7.275 ns |
LoopOptimizations_int | .NET 9.0 | 10000 | 3,044.72 ns | 10.149 ns | 9.493 ns |
LoopOptimizations_nint | .NET 9.0 | 10000 | 2,307.03 ns | 11.474 ns | 10.732 ns |
LoopOptimizations_nint_unsafe | .NET 9.0 | 10000 | 2,250.03 ns | 11.906 ns | 11.137 ns |
LoopOptimizations_nint_unsafe_unwidened | .NET 9.0 | 10000 | 3,046.68 ns | 9.379 ns | 8.773 ns |
LoopOptimizations_nint_unsafe_unwidened_unwidened | .NET 9.0 | 10000 | 3,050.47 ns | 8.364 ns | 7.823 ns |
[Benchmark]
public void LoopOptimizations_int()
{
int[] arr = _arr;
Sum_int(arr);
}
[Benchmark]
public void LoopOptimizations_nint()
{
int[] arr = _arr;
Sum_nint(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe()
{
int[] arr = _arr;
Sum_nint_unsafe(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_unwidened(arr);
}
private static int Sum_int(int[] arr)
{
int sum = 0;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint_unsafe(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
Interesting -- it seems your CPU really benefits from having the i++
and compare widened as well. Actually, I am not sure whether it is just the i++
or also the compare that should be widened -- do you mind measuring the following variant too? This one only widens the i++
. Hopefully the last one this time :-)
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
@jakobbotsch
I hope the following content helps you.
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-HGUYGD : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-YOTCVU : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove MemoryRandomization=True
Method | Runtime | Count | Mean | Error | StdDev |
---|---|---|---|---|---|
ForLoop_int | .NET 8.0 | 100 | 36.08 ns | 0.562 ns | 0.526 ns |
ForLoop_nint | .NET 8.0 | 100 | 34.68 ns | 0.687 ns | 0.735 ns |
ForLoop_nint_unsafe | .NET 8.0 | 100 | 27.53 ns | 0.260 ns | 0.243 ns |
ForLoop_nint_unsafe_unwidened | .NET 8.0 | 100 | 37.74 ns | 0.445 ns | 0.417 ns |
ForLoop_nint_unsafe_unwidened_unwidened | .NET 8.0 | 100 | 38.11 ns | 0.790 ns | 0.776 ns |
ForLoop_nint_unsafe_unwidened_widened | .NET 8.0 | 100 | 27.69 ns | 0.297 ns | 0.278 ns |
ForLoop_int | .NET 9.0 | 100 | 37.92 ns | 0.787 ns | 0.967 ns |
ForLoop_nint | .NET 9.0 | 100 | 34.83 ns | 0.526 ns | 0.492 ns |
ForLoop_nint_unsafe | .NET 9.0 | 100 | 27.86 ns | 0.463 ns | 0.433 ns |
ForLoop_nint_unsafe_unwidened | .NET 9.0 | 100 | 37.47 ns | 0.370 ns | 0.346 ns |
ForLoop_nint_unsafe_unwidened_unwidened | .NET 9.0 | 100 | 38.08 ns | 0.505 ns | 0.472 ns |
ForLoop_nint_unsafe_unwidened_widened | .NET 9.0 | 100 | 27.49 ns | 0.253 ns | 0.236 ns |
ForLoop_int | .NET 8.0 | 1000 | 284.28 ns | 2.108 ns | 1.972 ns |
ForLoop_nint | .NET 8.0 | 1000 | 234.87 ns | 3.662 ns | 3.426 ns |
ForLoop_nint_unsafe | .NET 8.0 | 1000 | 220.50 ns | 3.959 ns | 3.703 ns |
ForLoop_nint_unsafe_unwidened | .NET 8.0 | 1000 | 306.49 ns | 0.977 ns | 0.914 ns |
ForLoop_nint_unsafe_unwidened_unwidened | .NET 8.0 | 1000 | 308.39 ns | 3.871 ns | 3.621 ns |
ForLoop_nint_unsafe_unwidened_widened | .NET 8.0 | 1000 | 218.25 ns | 0.508 ns | 0.476 ns |
ForLoop_int | .NET 9.0 | 1000 | 307.51 ns | 1.491 ns | 1.394 ns |
ForLoop_nint | .NET 9.0 | 1000 | 232.28 ns | 0.539 ns | 0.504 ns |
ForLoop_nint_unsafe | .NET 9.0 | 1000 | 219.25 ns | 1.951 ns | 1.825 ns |
ForLoop_nint_unsafe_unwidened | .NET 9.0 | 1000 | 309.28 ns | 3.497 ns | 3.271 ns |
ForLoop_nint_unsafe_unwidened_unwidened | .NET 9.0 | 1000 | 308.69 ns | 5.287 ns | 4.946 ns |
ForLoop_nint_unsafe_unwidened_widened | .NET 9.0 | 1000 | 220.08 ns | 1.661 ns | 1.554 ns |
ForLoop_int | .NET 8.0 | 10000 | 2,781.48 ns | 19.513 ns | 18.252 ns |
ForLoop_nint | .NET 8.0 | 10000 | 2,236.87 ns | 22.613 ns | 21.152 ns |
ForLoop_nint_unsafe | .NET 8.0 | 10000 | 2,294.69 ns | 45.756 ns | 48.959 ns |
ForLoop_nint_unsafe_unwidened | .NET 8.0 | 10000 | 3,014.57 ns | 40.097 ns | 37.507 ns |
ForLoop_nint_unsafe_unwidened_unwidened | .NET 8.0 | 10000 | 3,030.63 ns | 44.037 ns | 41.193 ns |
ForLoop_nint_unsafe_unwidened_widened | .NET 8.0 | 10000 | 2,216.69 ns | 34.966 ns | 32.707 ns |
ForLoop_int | .NET 9.0 | 10000 | 3,000.21 ns | 10.061 ns | 9.411 ns |
ForLoop_nint | .NET 9.0 | 10000 | 2,227.37 ns | 12.335 ns | 11.538 ns |
ForLoop_nint_unsafe | .NET 9.0 | 10000 | 2,214.29 ns | 25.932 ns | 24.257 ns |
ForLoop_nint_unsafe_unwidened | .NET 9.0 | 10000 | 3,002.29 ns | 16.117 ns | 15.076 ns |
ForLoop_nint_unsafe_unwidened_unwidened | .NET 9.0 | 10000 | 3,024.86 ns | 40.873 ns | 38.233 ns |
ForLoop_nint_unsafe_unwidened_widened | .NET 9.0 | 10000 | 2,212.96 ns | 29.958 ns | 28.023 ns |
Thanks for running those measurements. I still am unable to reproduce the results, even on my own Intel CPU:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22635.3858) Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-JKDKMJ : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
Runtime=.NET 9.0
Method | Count | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
LoopOptimizations_int | 10000 | 2.766 us | 0.0204 us | 0.0160 us | 1.00 |
LoopOptimizations_nint | 10000 | 2.459 us | 0.0356 us | 0.0333 us | 0.89 |
LoopOptimizations_nint_unsafe | 10000 | 2.730 us | 0.0350 us | 0.0327 us | 0.99 |
LoopOptimizations_nint_unsafe_unwidened | 10000 | 2.732 us | 0.0328 us | 0.0307 us | 0.99 |
LoopOptimizations_nint_unsafe_unwidened_unwidened | 10000 | 2.754 us | 0.0187 us | 0.0166 us | 1.00 |
LoopOptimizations_nint_unsafe_unwidened_widened | 10000 | 2.739 us | 0.0105 us | 0.0087 us | 0.99 |
LoopOptimizations_nint
being faster is #104665.
I still have my suspicions that there is some micro architectural artifact here... Going to see if I can run the benchmarks on some more CPUs.
@EgorBot -arm64 -intel -amd -commit 42b2b19e883f06af5771b5d85b26af263c62e781 --disasm
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
public class Benchmark
{
private int[] _arr = null!;
[Params(10000)]
public int Count { get; set; }
[GlobalSetup]
public void GlobalSetup()
{
_arr = new int[Count];
}
[Benchmark(Baseline = true)]
public void LoopOptimizations_int()
{
int[] arr = _arr;
Sum_int(arr);
}
[Benchmark]
public void LoopOptimizations_nint()
{
int[] arr = _arr;
Sum_nint(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe()
{
int[] arr = _arr;
Sum_nint_unsafe(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_widened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_widened(arr);
}
private static int Sum_int(int[] arr)
{
int sum = 0;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint_unsafe(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
}
@EgorBot -arm64 -intel -amd -commit 42b2b19e883f06af5771b5d85b26af263c62e781 vs c09ec6552f11b74a2e825cb63cb7c45f5552d3f2 --disasm
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
public class Benchmark
{
private int[] _arr = null!;
[Params(10000)]
public int Count { get; set; }
[GlobalSetup]
public void GlobalSetup()
{
_arr = new int[Count];
}
[Benchmark(Baseline = true)]
public void LoopOptimizations_int()
{
int[] arr = _arr;
Sum_int(arr);
}
[Benchmark]
public void LoopOptimizations_nint()
{
int[] arr = _arr;
Sum_nint(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe()
{
int[] arr = _arr;
Sum_nint_unsafe(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_widened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_widened(arr);
}
private static int Sum_int(int[] arr)
{
int sum = 0;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint_unsafe(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
}
❌ Not more than two CPU architectures at once, please.
@EgorBot -intel -amd -commit 42b2b19e883f06af5771b5d85b26af263c62e781 vs c09ec6552f11b74a2e825cb63cb7c45f5552d3f2 --disasm
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
public class Benchmark
{
private int[] _arr = null!;
[Params(10000)]
public int Count { get; set; }
[GlobalSetup]
public void GlobalSetup()
{
_arr = new int[Count];
}
[Benchmark(Baseline = true)]
public void LoopOptimizations_int()
{
int[] arr = _arr;
Sum_int(arr);
}
[Benchmark]
public void LoopOptimizations_nint()
{
int[] arr = _arr;
Sum_nint(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe()
{
int[] arr = _arr;
Sum_nint_unsafe(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_widened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_widened(arr);
}
private static int Sum_int(int[] arr)
{
int sum = 0;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint_unsafe(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
}
@EgorBot -arm64 -commit 42b2b19e883f06af5771b5d85b26af263c62e781 vs c09ec6552f11b74a2e825cb63cb7c45f5552d3f2 --disasm
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
public class Benchmark
{
private int[] _arr = null!;
[Params(10000)]
public int Count { get; set; }
[GlobalSetup]
public void GlobalSetup()
{
_arr = new int[Count];
}
[Benchmark(Baseline = true)]
public void LoopOptimizations_int()
{
int[] arr = _arr;
Sum_int(arr);
}
[Benchmark]
public void LoopOptimizations_nint()
{
int[] arr = _arr;
Sum_nint(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe()
{
int[] arr = _arr;
Sum_nint_unsafe(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_unwidened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_unwidened(arr);
}
[Benchmark]
public void LoopOptimizations_nint_unsafe_unwidened_widened()
{
int[] arr = _arr;
Sum_nint_unsafe_unwidened_widened(arr);
}
private static int Sum_int(int[] arr)
{
int sum = 0;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
private static int Sum_nint_unsafe(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; i < arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_unwidened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i = (nint)((nuint)((uint)i + 1)))
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
}
@jakobbotsch
I don't know how else I can help you from my side. I tried using another computer with a 13600KF CPU, and these are my results.
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3) 13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job-YKBOTF : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove Runtime=.NET 9.0 MemoryRandomization=True
Method | Count | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
LoopOptimizations_int | 10000 | 2.724 us | 0.0181 us | 0.0169 us | 2.720 us | 1.00 | 0.00 |
LoopOptimizations_nint | 10000 | 3.188 us | 0.4768 us | 1.4057 us | 2.005 us | 1.01 | 0.48 |
LoopOptimizations_nint_unsafe | 10000 | 1.983 us | 0.0074 us | 0.0070 us | 1.983 us | 0.73 | 0.00 |
LoopOptimizations_nint_unsafe_unwidened | 10000 | 2.691 us | 0.0032 us | 0.0030 us | 2.689 us | 0.99 | 0.01 |
LoopOptimizations_nint_unsafe_unwidened_unwidened | 10000 | 2.689 us | 0.0015 us | 0.0014 us | 2.689 us | 0.99 | 0.01 |
LoopOptimizations_nint_unsafe_unwidened_widened | 10000 | 1.974 us | 0.0019 us | 0.0018 us | 1.973 us | 0.72 | 0.00 |
I tried using another computer with a 13600KF CPU, and these are my results.
Kindly reminder to be aware of the asymetric P/E cores. Keep your benchmark window in foreground or set affinity.
Sorry, I set the affinity to CPU 0-5 and kept the window in the foreground. I ran the test again.
13600KF
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.100-preview.6.24328.19
[Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
Job-YUEOQM : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove Runtime=.NET 9.0 MemoryRandomization=True
Method | Count | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
LoopOptimizations_int | 10000 | 2.704 us | 0.0139 us | 0.0130 us | 1.00 |
LoopOptimizations_nint | 10000 | 1.990 us | 0.0036 us | 0.0034 us | 0.74 |
LoopOptimizations_nint_unsafe | 10000 | 1.976 us | 0.0026 us | 0.0024 us | 0.73 |
LoopOptimizations_nint_unsafe_unwidened | 10000 | 2.691 us | 0.0037 us | 0.0035 us | 1.00 |
LoopOptimizations_nint_unsafe_unwidened_unwidened | 10000 | 2.693 us | 0.0046 us | 0.0043 us | 1.00 |
LoopOptimizations_nint_unsafe_unwidened_widened | 10000 | 1.978 us | 0.0039 us | 0.0036 us | 0.73 |
12500 (Only put window in foreground)
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-preview.6.24328.19
[Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
Job-PVQIZI : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove Runtime=.NET 9.0 MemoryRandomization=True
Method | Count | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
LoopOptimizations_int | 10000 | 3.071 us | 0.0138 us | 0.0129 us | 1.00 |
LoopOptimizations_nint | 10000 | 2.328 us | 0.0283 us | 0.0265 us | 0.76 |
LoopOptimizations_nint_unsafe | 10000 | 2.252 us | 0.0171 us | 0.0160 us | 0.73 |
LoopOptimizations_nint_unsafe_unwidened | 10000 | 3.050 us | 0.0064 us | 0.0059 us | 0.99 |
LoopOptimizations_nint_unsafe_unwidened_unwidened | 10000 | 3.045 us | 0.0101 us | 0.0095 us | 0.99 |
LoopOptimizations_nint_unsafe_unwidened_widened | 10000 | 2.251 us | 0.0101 us | 0.0095 us | 0.73 |
@SystematicChaos012 Thanks a lot for helping to investigate this.
I'm curious, does the difference reproduce for you with a simple standalone app? Make sure to set the environment variable DOTNET_TieredCompilation=0
before running this.
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
class Program
{
static void Main(string[] args)
{
int[] arr = new int[10000];
Stopwatch timer = new Stopwatch();
for (int i = 0; i < 10; i++)
{
timer.Restart();
for (int j = 0; j < 100000; j++)
{
Sum_nint_unsafe_unwidened_widened(arr);
}
Console.WriteLine("Sum_nint_unsafe_unwidened_widened: {0:F2} us per invoc", timer.Elapsed.TotalMilliseconds * 1000 / 100000);
}
for (int i = 0; i < 10; i++)
{
timer.Restart();
for (int j = 0; j < 100000; j++)
{
Sum_int(arr);
}
Console.WriteLine("Sum_int: {0:F2} us per invoc", timer.Elapsed.TotalMilliseconds * 1000 / 100000);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static int Sum_int(int[] arr)
{
int sum = 0;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
}
return sum;
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static int Sum_nint_unsafe_unwidened_widened(int[] arr)
{
int sum = 0;
for (nint i = 0; (uint)i < (uint)arr.Length; i++)
{
sum += Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(arr), i);
}
return sum;
}
}
If it does and you wouldn't mind then it would be very helpful if you can compare the VTune microarchitectural profile traces of the different versions. The difference in those two traces should help us definitively try to figure out whether the CPU is benefitting from something due to the widening, or whether there are any other effects (like, say, just the code size differences).
@jakobbotsch I have packaged the results, this is my first time using the VTune. I hope I haven't made any mistakes.
Thanks a lot for those results. I will try to take a look at them soon, however, I think the original loop perf should be fixed in preview 7 by virtue of strength reduction. Feel free to try that out once preview 7 is available.
Description
I found a performance regression through benchmark testing of the Loop Optimizations: IV Widening sample.
// Summary
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500, 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.6.24328.19 [Host] : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2 Job-SMYUOB : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 Job-VNQOKB : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
OutlierMode=DontRemove MemoryRandomization=True
Reproduction Steps
Expected behavior
.NET 9 should be faster or at least as fast as .NET 8.
Actual behavior
.NET 9 is slower than .NET 8.
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response