Closed performanceautofiler[bot] closed 3 years ago
Tagging subscribers to this area: @GrabYourPitchForks, @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.
Author: | performanceautofiler[bot] |
---|---|
Assignees: | - |
Labels: | `area-System.Memory`, `untriaged` |
Milestone: | - |
The ASM diff is:
diff --git "a/.\\old_disasm.txt" "b/.\\new_disasm.txt"
index c28988d..1bf670f 100644
--- "a/.\\old_disasm.txt"
+++ "b/.\\new_disasm.txt"
@@ -8,62 +8,64 @@
; Final local variable assignments
;
; V00 arg0 [V00,T05] ( 3, 6 ) byref -> rcx single-def
-; V01 arg1 [V01,T07] ( 3, 3 ) int -> rdx single-def
-; V02 arg2 [V02,T06] ( 3, 6 ) ubyte -> [rsp+18H] do-not-enreg[F] ld-addr-op single-def
-; V03 loc0 [V03,T04] ( 5, 11.50) int -> rax
+; V01 arg1 [V01,T06] ( 3, 3 ) int -> rdx single-def
+; V02 arg2 [V02,T07] ( 3, 3 ) ubyte -> r8 ld-addr-op single-def
+; V03 loc0 [V03,T04] ( 5, 11.50) int -> r9
; V04 loc1 [V04,T02] ( 5, 12 ) int -> rdx
; V05 loc2 [V05,T01] ( 5, 12.50) int -> r8
-; V06 loc3 [V06,T03] ( 3, 12 ) int -> r10
+; V06 loc3 [V06,T03] ( 3, 12 ) int -> r11
;# V07 OutArgs [V07 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
-; V08 tmp1 [V08,T00] ( 2, 16 ) ubyte -> r9 "Inlining Arg"
+; V08 tmp1 [V08,T00] ( 2, 16 ) ubyte -> r10 "Inlining Arg"
+; V09 cse0 [V09,T08] ( 2, 5 ) int -> rax "CSE - aggressive"
;
; Lcl frame size = 0
G_M10589_IG01: ;; offset=0000H
- 4489442418 mov dword ptr [rsp+18H], r8d
- ;; bbWeight=1 PerfScore 1.00
-G_M10589_IG02: ;; offset=0005H
- 33C0 xor eax, eax
+ ;; bbWeight=1 PerfScore 0.00
+G_M10589_IG02: ;; offset=0000H
+ 4533C9 xor r9d, r9d
FFCA dec edx
85D2 test edx, edx
- 7C2D jl SHORT G_M10589_IG07
+ 7C2F jl SHORT G_M10589_IG07
+ 410FB6C0 movzx rax, r8b
align [0 bytes]
- ;; bbWeight=1 PerfScore 1.75
+ ;; bbWeight=1 PerfScore 2.00
G_M10589_IG03: ;; offset=000DH
- 448D0402 lea r8d, [rdx+rax]
+ 468D040A lea r8d, [rdx+r9]
41D1E8 shr r8d, 1
- 4D63C8 movsxd r9, r8d
- 460FB60C09 movzx r9, byte ptr [rcx+r9]
- 440FB6542418 movzx r10, byte ptr [rsp+18H]
- 452BD1 sub r10d, r9d
- 7416 je SHORT G_M10589_IG09
- 4585D2 test r10d, r10d
+ 4D63D0 movsxd r10, r8d
+ 460FB61411 movzx r10, byte ptr [rcx+r10]
+ 448BD8 mov r11d, eax
+ 452BDA sub r11d, r10d
+ 741A je SHORT G_M10589_IG09
+ 4585DB test r11d, r11d
7E06 jle SHORT G_M10589_IG05
- ;; bbWeight=4 PerfScore 27.00
-G_M10589_IG04: ;; offset=002CH
- 418D4001 lea eax, [r8+1]
+ ;; bbWeight=4 PerfScore 24.00
+G_M10589_IG04: ;; offset=0029H
+ 458D4801 lea r9d, [r8+1]
EB04 jmp SHORT G_M10589_IG06
;; bbWeight=2 PerfScore 5.00
-G_M10589_IG05: ;; offset=0032H
+G_M10589_IG05: ;; offset=002FH
418D50FF lea edx, [r8-1]
;; bbWeight=2 PerfScore 1.00
-G_M10589_IG06: ;; offset=0036H
- 3BC2 cmp eax, edx
- 7ED3 jle SHORT G_M10589_IG03
+G_M10589_IG06: ;; offset=0033H
+ 443BCA cmp r9d, edx
+ 7ED5 jle SHORT G_M10589_IG03
;; bbWeight=4 PerfScore 5.00
-G_M10589_IG07: ;; offset=003AH
+G_M10589_IG07: ;; offset=0038H
+ 418BC1 mov eax, r9d
F7D0 not eax
- ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG08: ;; offset=003CH
+ ;; bbWeight=0.50 PerfScore 0.25
+G_M10589_IG08: ;; offset=003DH
C3 ret
;; bbWeight=0.50 PerfScore 0.50
-G_M10589_IG09: ;; offset=003DH
+G_M10589_IG09: ;; offset=003EH
418BC0 mov eax, r8d
;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG10: ;; offset=0040H
+G_M10589_IG10: ;; offset=0041H
C3 ret
;; bbWeight=0.50 PerfScore 0.50
-; Total bytes of code 65, prolog size 5, PerfScore 48.50, instruction count 24, allocated bytes for code 65 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
+; Total bytes of code 66, prolog size 0, PerfScore 44.98, instruction count 25, allocated bytes for code 66 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
; ============================================================
We enregister the byte arg now and the inner loop looks almost equal. There is a reg-reg mov
instead of a memory load but I don't see why this would regress performance. I'll take a closer look tomorrow, this disasm is with R2R disabled, maybe the R2R code is different.
Oops, I am looking at Windows which indeed saw improvements. Will check the Ubuntu diff tomorrow.
The codegen changes for Unix look similar, at least with alt-jit:
diff --git "a/.\\old_disasm.txt" "b/.\\new_disasm.txt"
index 7dc58c6..c22ddca 100644
--- "a/.\\old_disasm.txt"
+++ "b/.\\new_disasm.txt"
@@ -1,77 +1,76 @@
; Assembly listing for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T05] ( 3, 6 ) byref -> rdi single-def
-; V01 arg1 [V01,T07] ( 3, 3 ) int -> rsi single-def
-; V02 arg2 [V02,T06] ( 3, 6 ) ubyte -> [rbp-04H] do-not-enreg[F] ld-addr-op single-def
-; V03 loc0 [V03,T04] ( 5, 11.50) int -> rax
+; V01 arg1 [V01,T06] ( 3, 3 ) int -> rsi single-def
+; V02 arg2 [V02,T07] ( 3, 3 ) ubyte -> rdx ld-addr-op single-def
+; V03 loc0 [V03,T04] ( 5, 11.50) int -> rcx
; V04 loc1 [V04,T02] ( 5, 12 ) int -> rsi
; V05 loc2 [V05,T01] ( 5, 12.50) int -> rdx
-; V06 loc3 [V06,T03] ( 3, 12 ) int -> r8
+; V06 loc3 [V06,T03] ( 3, 12 ) int -> r9
;# V07 OutArgs [V07 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
-; V08 tmp1 [V08,T00] ( 2, 16 ) ubyte -> rcx "Inlining Arg"
+; V08 tmp1 [V08,T00] ( 2, 16 ) ubyte -> r8 "Inlining Arg"
+; V09 cse0 [V09,T08] ( 2, 5 ) int -> rax "CSE - aggressive"
;
-; Lcl frame size = 16
+; Lcl frame size = 0
G_M10589_IG01: ;; offset=0000H
55 push rbp
- 4883EC10 sub rsp, 16
- 488D6C2410 lea rbp, [rsp+10H]
- 8955FC mov dword ptr [rbp-04H], edx
- ;; bbWeight=1 PerfScore 2.75
-G_M10589_IG02: ;; offset=000DH
- 33C0 xor eax, eax
+ 488BEC mov rbp, rsp
+ ;; bbWeight=1 PerfScore 1.25
+G_M10589_IG02: ;; offset=0004H
+ 33C9 xor ecx, ecx
FFCE dec esi
85F6 test esi, esi
- 7C27 jl SHORT G_M10589_IG07
+ 7C29 jl SHORT G_M10589_IG07
+ 0FB6C2 movzx rax, dl
align [0 bytes]
- ;; bbWeight=1 PerfScore 1.75
-G_M10589_IG03: ;; offset=0015H
- 8D1406 lea edx, [rsi+rax]
+ ;; bbWeight=1 PerfScore 2.00
+G_M10589_IG03: ;; offset=000FH
+ 8D140E lea edx, [rsi+rcx]
D1EA shr edx, 1
- 4863CA movsxd rcx, edx
- 0FB60C0F movzx rcx, byte ptr [rdi+rcx]
- 440FB645FC movzx r8, byte ptr [rbp-04H]
- 442BC1 sub r8d, ecx
- 7419 je SHORT G_M10589_IG09
- 4585C0 test r8d, r8d
+ 4C63C2 movsxd r8, edx
+ 460FB60407 movzx r8, byte ptr [rdi+r8]
+ 448BC8 mov r9d, eax
+ 452BC8 sub r9d, r8d
+ 7417 je SHORT G_M10589_IG09
+ 4585C9 test r9d, r9d
7E05 jle SHORT G_M10589_IG05
- ;; bbWeight=4 PerfScore 27.00
-G_M10589_IG04: ;; offset=0030H
- 8D4201 lea eax, [rdx+1]
+ ;; bbWeight=4 PerfScore 24.00
+G_M10589_IG04: ;; offset=0029H
+ 8D4A01 lea ecx, [rdx+1]
EB03 jmp SHORT G_M10589_IG06
;; bbWeight=2 PerfScore 5.00
-G_M10589_IG05: ;; offset=0035H
+G_M10589_IG05: ;; offset=002EH
8D72FF lea esi, [rdx-1]
;; bbWeight=2 PerfScore 1.00
-G_M10589_IG06: ;; offset=0038H
- 3BC6 cmp eax, esi
- 7ED9 jle SHORT G_M10589_IG03
+G_M10589_IG06: ;; offset=0031H
+ 3BCE cmp ecx, esi
+ 7EDA jle SHORT G_M10589_IG03
;; bbWeight=4 PerfScore 5.00
-G_M10589_IG07: ;; offset=003CH
+G_M10589_IG07: ;; offset=0035H
+ 8BC1 mov eax, ecx
F7D0 not eax
- ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG08: ;; offset=003EH
- 4883C410 add rsp, 16
+ ;; bbWeight=0.50 PerfScore 0.25
+G_M10589_IG08: ;; offset=0039H
5D pop rbp
C3 ret
- ;; bbWeight=0.50 PerfScore 0.88
-G_M10589_IG09: ;; offset=0044H
+ ;; bbWeight=0.50 PerfScore 0.75
+G_M10589_IG09: ;; offset=003BH
8BC2 mov eax, edx
;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG10: ;; offset=0046H
- 4883C410 add rsp, 16
+G_M10589_IG10: ;; offset=003DH
5D pop rbp
C3 ret
- ;; bbWeight=0.50 PerfScore 0.88
+ ;; bbWeight=0.50 PerfScore 0.75
-; Total bytes of code 76, prolog size 13, PerfScore 52.10, instruction count 31, allocated bytes for code 76 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
+; Total bytes of code 63, prolog size 4, PerfScore 46.43, instruction count 29, allocated bytes for code 63 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
; ============================================================
I don't seem to be able to reproduce this:
BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21378.1
[Host] : .NET 6.0.0 (6.0.21.37624), X64 RyuJIT
Job-UQAAIQ : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
Job-XZCLYE : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BinarySearch | Job-UQAAIQ | /new/corerun | 512 | 10.34 ns | 0.068 ns | 0.063 ns | 10.31 ns | 10.27 ns | 10.45 ns | 0.86 | - | - | - | - | 163 B |
BinarySearch | Job-XZCLYE | /old/corerun | 512 | 12.06 ns | 0.083 ns | 0.073 ns | 12.07 ns | 11.94 ns | 12.16 ns | 1.00 | - | - | - | - | 176 B |
new
is 9d9b465f0be92a0d7584c4b8ad6b873be7f0dbc9 while old
is the same but with a03a5c8ec376e90a3721fb77504841a5d1b43e13 reverted.
I am invoking:
python3 ./scripts/benchmarks_ci.py -f net6.0 --corerun /home/jbn/dev/dotnet/issues/56402/old/corerun /home/jbn/dev/dotnet/issues/56402/new/corerun --filter 'System.Memory.Span<Byte>*BinarySearch'
@DrewScoggins does that look like the right way to reproduce the issue?
cc @kunalspathak
I also tried downloading the baseline and compare from the issue and I get the same improvement:
BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21378.1
[Host] : .NET 6.0.0 (6.0.21.37624), X64 RyuJIT
Job-HALSTK : .NET 6.0.0 (6.0.21.37213), X64 RyuJIT
Job-NWLTBS : .NET 6.0.0 (6.0.21.37303), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BinarySearch | Job-HALSTK | /baseline/corerun | 512 | 11.454 ns | 0.2494 ns | 0.2333 ns | 11.478 ns | 11.187 ns | 11.91 ns | 1.00 | 0.00 | 119 B | - | - | - | - |
BinarySearch | Job-NWLTBS | /compare/corerun | 512 | 9.946 ns | 0.1749 ns | 0.1636 ns | 9.973 ns | 9.600 ns | 10.13 ns | 0.87 | 0.02 | 106 B | - | - | - | - |
@jakobbotsch I usually invoke benchmarks like this:
cd src\benchmarks\micro
dotnet run -c Release -f net6.0 -- --coreRun /path/to/release/corerun.exe /path/to/release/corerun.exe --filter "*benchName*"
but I guess it's the same as you do.
I also pass --iterationTime 500
sometimes.
Make sure that you do not have any COMPlus_
variables like Tier=0
or ReadyToRun=0
set. I also use -d
switch to print the disassembly by the harness.
I did little investigation and it seems to be jcc erratum might be affecting the performance. If you see all test history of windows, it is unstable, so it could be that the improvement that we saw in https://github.com/dotnet/perf-autofiling-issues/issues/28 is just another low of the instability on windows after your change.
Here is the diff on windows/x64:
On Linux, the benchmark seems pretty consistent with less variation if you see the all test history
Here is the diff on linux/x64:
If you notice, that both linux and windows has instruction sub
that crosses the alignment boundary leading to JCC erratum and it could be that it is more evident in Linux than on Windows.
Thanks for the additional look @kunalspathak.
Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set. I also use -d switch to print the disassembly by the harness.
I did check the disassembly too, but I didn't know about the jcc erratum. That erratum does not affect my CPU which is quite old (Haswell), so that probably explains why I see improvements.
Have we thought about adding a mitigation for this erratum that can be turned on if the user wants to? E.g. for these automatic reports that could be quite useful.
Have we thought about adding a mitigation for this erratum that can be turned on if the user wants to? E.g. for these automatic reports that could be quite useful.
Yes, we want to add it and there was discussion frequently with @BruceForstall . I will see if I can add something behind a switch so we can try and validate it for benchmarks.
I tried turning memoryRandomization on to see if I could reproduce the instability. I did --iterationTime 5000 but I got pretty much identical results.
Span<Byte>.BinarySearch: Job-ILKRNU(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=5.0000 s, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.755 ns, StdErr = 0.016 ns (0.14%), N = 14, StdDev = 0.058 ns
Min = 10.685 ns, Q1 = 10.708 ns, Median = 10.745 ns, Q3 = 10.782 ns, Max = 10.882 ns
IQR = 0.074 ns, LowerFence = 10.597 ns, UpperFence = 10.892 ns
ConfidenceInterval = [10.689 ns; 10.821 ns] (CI 99.9%), Margin = 0.066 ns (0.61% of Mean)
Skewness = 0.64, Kurtosis = 2.27, MValue = 2
-------------------- Histogram --------------------
[10.654 ns ; 10.913 ns) | @@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
[Host] : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
Job-ILKRNU : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=5.0000 s MaxIterationCount=20 MinIterationCount=15
WarmupCount=1
Method | Size | Mean | Error | StdDev | Median | Min | Max | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BinarySearch | 512 | 10.75 ns | 0.066 ns | 0.058 ns | 10.74 ns | 10.69 ns | 10.88 ns | 163 B | - | - | - | - |
// * Detailed results *
Span<Byte>.BinarySearch: Job-KFWQEJ(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=5.0000 s, MaxIterationCount=20, MemoryRandomization=True, MinIterationCount=15, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.694 ns, StdErr = 0.011 ns (0.10%), N = 15, StdDev = 0.041 ns
Min = 10.655 ns, Q1 = 10.665 ns, Median = 10.677 ns, Q3 = 10.716 ns, Max = 10.790 ns
IQR = 0.051 ns, LowerFence = 10.589 ns, UpperFence = 10.792 ns
ConfidenceInterval = [10.651 ns; 10.738 ns] (CI 99.9%), Margin = 0.044 ns (0.41% of Mean)
Skewness = 0.96, Kurtosis = 2.63, MValue = 2
-------------------- Histogram --------------------
[10.633 ns ; 10.811 ns) | @@@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
[Host] : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
Job-KFWQEJ : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=5.0000 s MaxIterationCount=20 MemoryRandomization=True
MinIterationCount=15 WarmupCount=1
Method | Size | Mean | Error | StdDev | Median | Min | Max | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BinarySearch | 512 | 10.69 ns | 0.044 ns | 0.041 ns | 10.68 ns | 10.65 ns | 10.79 ns | 163 B | - | - | - | - |
FYI @adamsitnik - This benchmark does allocate and access array, so I am surprised that memory randomization doesn't show anything. Is there something that we are missing to randomize?
We are also seeing improvements from this change on Arm64. https://github.com/dotnet/perf-autofiling-issues/issues/76
This benchmark does allocate and access array, so I am surprised that memory randomization doesn't show anything. Is there something that we are missing to randomize?
Most likely we need more iterations (the default is 20). Sth like:
--minIterationCount 100 --maxIterationCount 101
With that I get:
Span<Byte>.BinarySearch: Job-HSGRQU(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=250.0000 ms, MaxIterationCount=101, MinIterationCount=100, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.212 ns, StdErr = 0.007 ns (0.07%), N = 84, StdDev = 0.065 ns
Min = 10.113 ns, Q1 = 10.158 ns, Median = 10.209 ns, Q3 = 10.248 ns, Max = 10.405 ns
IQR = 0.090 ns, LowerFence = 10.023 ns, UpperFence = 10.382 ns
ConfidenceInterval = [10.188 ns; 10.236 ns] (CI 99.9%), Margin = 0.024 ns (0.24% of Mean)
Skewness = 0.73, Kurtosis = 3.19, MValue = 2
-------------------- Histogram --------------------
[10.093 ns ; 10.424 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
[Host] : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
Job-HSGRQU : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=101 MinIterationCount=100
WarmupCount=1
Method | Size | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BinarySearch | 512 | 10.21 ns | 0.024 ns | 0.065 ns | 10.21 ns | 10.11 ns | 10.40 ns | - | - | - | - | 163 B |
// * Detailed results *
Span<Byte>.BinarySearch: Job-JQMXIT(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=250.0000 ms, MaxIterationCount=101, MemoryRandomization=True, MinIterationCount=100, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.314 ns, StdErr = 0.028 ns (0.27%), N = 100, StdDev = 0.276 ns
Min = 10.118 ns, Q1 = 10.213 ns, Median = 10.269 ns, Q3 = 10.330 ns, Max = 12.166 ns
IQR = 0.117 ns, LowerFence = 10.037 ns, UpperFence = 10.506 ns
ConfidenceInterval = [10.220 ns; 10.407 ns] (CI 99.9%), Margin = 0.094 ns (0.91% of Mean)
Skewness = 5.69, Kurtosis = 37.64, MValue = 2
-------------------- Histogram --------------------
[10.040 ns ; 10.333 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[10.333 ns ; 10.692 ns) | @@@@@@@@@@@@@@@@@@@@@
[10.692 ns ; 11.004 ns) |
[11.004 ns ; 11.317 ns) |
[11.317 ns ; 11.629 ns) |
[11.629 ns ; 12.244 ns) | @@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
[Host] : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
Job-JQMXIT : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=101 MemoryRandomization=True
MinIterationCount=100 WarmupCount=1
Method | Size | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BinarySearch | 512 | 10.31 ns | 0.094 ns | 0.276 ns | 10.27 ns | 10.12 ns | 12.17 ns | - | - | - | - | 163 B |
It definitely looks like there is some effect. I ran some more times and every time the histograms resembled the above: when memory randomization is on there are multiple buckets, while with memory randomization off there is just one bucket.
It definitely looks like there is some effect. I ran some more times and every time the histograms resembled the above: when memory randomization is on there are multiple buckets, while with memory randomization off there is just one bucket.
So it seems to be a memory alignment issue. Since it's an improvement on Windows x64 an Linux ARM64 and it's hard to repro on Linux (in the results provided above by @jakobbotsch it's 1 out of 100 iterations), I think that we can just close this issue.
Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set.
FWIW this is what the benchmarks_ci.py
script does OOTB
Run Information
Regressions in System.Memory.Span<Byte>
Historical Data in Reporting System
Repro