dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.42k stars 4.76k forks source link

[Perf] Regressions in System.Memory.Span<Byte>.BinarySearch #56402

Closed performanceautofiler[bot] closed 3 years ago

performanceautofiler[bot] commented 3 years ago

Run Information

Architecture x64
OS ubuntu 18.04
Baseline d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2
Compare bd3fb963e701743eb6847b898eda8571ead97155
Diff Diff

Regressions in System.Memory.Span<Byte>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
[BinarySearch - Duration of single invocation](<https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Memory.Span(Byte).BinarySearch(Size%3a%20512).html>) 9.28 ns 11.80 ns 1.27 0.03 True

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span&lt;Byte&gt;*'
### Payloads [Baseline]() [Compare]() ### Histogram #### System.Memory.Span<Byte>.BinarySearch(Size: 512) ```log ``` ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)
ghost commented 3 years ago

Tagging subscribers to this area: @GrabYourPitchForks, @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details
### Run Information Architecture | x64 -- | -- OS | ubuntu 18.04 Baseline | [d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2](https://github.com/dotnet/runtime/commit/d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2) Compare | [bd3fb963e701743eb6847b898eda8571ead97155](https://github.com/dotnet/runtime/commit/bd3fb963e701743eb6847b898eda8571ead97155) Diff | [Diff](https://github.com/dotnet/runtime/compare/d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2...bd3fb963e701743eb6847b898eda8571ead97155) ### Regressions in System.Memory.Span<Byte> Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- [BinarySearch - Duration of single invocation]() | 9.28 ns | 11.80 ns | 1.27 | 0.03 | True | | | ![graph]() [Historical Data in Reporting System]() ### Repro ```cmd git clone https://github.com/dotnet/performance.git python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span<Byte>*' ```
### Payloads [Baseline]() [Compare]() ### Histogram #### System.Memory.Span<Byte>.BinarySearch(Size: 512) ```log ``` ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)
Author: performanceautofiler[bot]
Assignees: -
Labels: `area-System.Memory`, `untriaged`
Milestone: -
jakobbotsch commented 3 years ago

The ASM diff is:

diff --git "a/.\\old_disasm.txt" "b/.\\new_disasm.txt"
index c28988d..1bf670f 100644
--- "a/.\\old_disasm.txt"
+++ "b/.\\new_disasm.txt"
@@ -8,62 +8,64 @@
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T05] (  3,  6   )   byref  ->  rcx         single-def
-;  V01 arg1         [V01,T07] (  3,  3   )     int  ->  rdx         single-def
-;  V02 arg2         [V02,T06] (  3,  6   )   ubyte  ->  [rsp+18H]   do-not-enreg[F] ld-addr-op single-def
-;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->  rax        
+;  V01 arg1         [V01,T06] (  3,  3   )     int  ->  rdx         single-def
+;  V02 arg2         [V02,T07] (  3,  3   )   ubyte  ->   r8         ld-addr-op single-def
+;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->   r9        
 ;  V04 loc1         [V04,T02] (  5, 12   )     int  ->  rdx        
 ;  V05 loc2         [V05,T01] (  5, 12.50)     int  ->   r8        
-;  V06 loc3         [V06,T03] (  3, 12   )     int  ->  r10        
+;  V06 loc3         [V06,T03] (  3, 12   )     int  ->  r11        
 ;# V07 OutArgs      [V07    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->   r9         "Inlining Arg"
+;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->  r10         "Inlining Arg"
+;  V09 cse0         [V09,T08] (  2,  5   )     int  ->  rax         "CSE - aggressive"
 ;
 ; Lcl frame size = 0

 G_M10589_IG01:              ;; offset=0000H
-       4489442418           mov      dword ptr [rsp+18H], r8d
-                       ;; bbWeight=1    PerfScore 1.00
-G_M10589_IG02:              ;; offset=0005H
-       33C0                 xor      eax, eax
+                       ;; bbWeight=1    PerfScore 0.00
+G_M10589_IG02:              ;; offset=0000H
+       4533C9               xor      r9d, r9d
        FFCA                 dec      edx
        85D2                 test     edx, edx
-       7C2D                 jl       SHORT G_M10589_IG07
+       7C2F                 jl       SHORT G_M10589_IG07
+       410FB6C0             movzx    rax, r8b
                             align    [0 bytes]
-                       ;; bbWeight=1    PerfScore 1.75
+                       ;; bbWeight=1    PerfScore 2.00
 G_M10589_IG03:              ;; offset=000DH
-       448D0402             lea      r8d, [rdx+rax]
+       468D040A             lea      r8d, [rdx+r9]
        41D1E8               shr      r8d, 1
-       4D63C8               movsxd   r9, r8d
-       460FB60C09           movzx    r9, byte  ptr [rcx+r9]
-       440FB6542418         movzx    r10, byte  ptr [rsp+18H]
-       452BD1               sub      r10d, r9d
-       7416                 je       SHORT G_M10589_IG09
-       4585D2               test     r10d, r10d
+       4D63D0               movsxd   r10, r8d
+       460FB61411           movzx    r10, byte  ptr [rcx+r10]
+       448BD8               mov      r11d, eax
+       452BDA               sub      r11d, r10d
+       741A                 je       SHORT G_M10589_IG09
+       4585DB               test     r11d, r11d
        7E06                 jle      SHORT G_M10589_IG05
-                       ;; bbWeight=4    PerfScore 27.00
-G_M10589_IG04:              ;; offset=002CH
-       418D4001             lea      eax, [r8+1]
+                       ;; bbWeight=4    PerfScore 24.00
+G_M10589_IG04:              ;; offset=0029H
+       458D4801             lea      r9d, [r8+1]
        EB04                 jmp      SHORT G_M10589_IG06
                        ;; bbWeight=2    PerfScore 5.00
-G_M10589_IG05:              ;; offset=0032H
+G_M10589_IG05:              ;; offset=002FH
        418D50FF             lea      edx, [r8-1]
                        ;; bbWeight=2    PerfScore 1.00
-G_M10589_IG06:              ;; offset=0036H
-       3BC2                 cmp      eax, edx
-       7ED3                 jle      SHORT G_M10589_IG03
+G_M10589_IG06:              ;; offset=0033H
+       443BCA               cmp      r9d, edx
+       7ED5                 jle      SHORT G_M10589_IG03
                        ;; bbWeight=4    PerfScore 5.00
-G_M10589_IG07:              ;; offset=003AH
+G_M10589_IG07:              ;; offset=0038H
+       418BC1               mov      eax, r9d
        F7D0                 not      eax
-                       ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG08:              ;; offset=003CH
+                       ;; bbWeight=0.50 PerfScore 0.25
+G_M10589_IG08:              ;; offset=003DH
        C3                   ret      
                        ;; bbWeight=0.50 PerfScore 0.50
-G_M10589_IG09:              ;; offset=003DH
+G_M10589_IG09:              ;; offset=003EH
        418BC0               mov      eax, r8d
                        ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG10:              ;; offset=0040H
+G_M10589_IG10:              ;; offset=0041H
        C3                   ret      
                        ;; bbWeight=0.50 PerfScore 0.50

-; Total bytes of code 65, prolog size 5, PerfScore 48.50, instruction count 24, allocated bytes for code 65 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
+; Total bytes of code 66, prolog size 0, PerfScore 44.98, instruction count 25, allocated bytes for code 66 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
 ; ============================================================

We enregister the byte arg now and the inner loop looks almost equal. There is a reg-reg mov instead of a memory load but I don't see why this would regress performance. I'll take a closer look tomorrow, this disasm is with R2R disabled, maybe the R2R code is different.

jakobbotsch commented 3 years ago

Oops, I am looking at Windows which indeed saw improvements. Will check the Ubuntu diff tomorrow.

jakobbotsch commented 3 years ago

The codegen changes for Unix look similar, at least with alt-jit:

diff --git "a/.\\old_disasm.txt" "b/.\\new_disasm.txt"
index 7dc58c6..c22ddca 100644
--- "a/.\\old_disasm.txt"
+++ "b/.\\new_disasm.txt"
@@ -1,77 +1,76 @@
 ; Assembly listing for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
 ; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
 ; optimized code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
 ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
 ; invoked as altjit
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T05] (  3,  6   )   byref  ->  rdi         single-def
-;  V01 arg1         [V01,T07] (  3,  3   )     int  ->  rsi         single-def
-;  V02 arg2         [V02,T06] (  3,  6   )   ubyte  ->  [rbp-04H]   do-not-enreg[F] ld-addr-op single-def
-;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->  rax        
+;  V01 arg1         [V01,T06] (  3,  3   )     int  ->  rsi         single-def
+;  V02 arg2         [V02,T07] (  3,  3   )   ubyte  ->  rdx         ld-addr-op single-def
+;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->  rcx        
 ;  V04 loc1         [V04,T02] (  5, 12   )     int  ->  rsi        
 ;  V05 loc2         [V05,T01] (  5, 12.50)     int  ->  rdx        
-;  V06 loc3         [V06,T03] (  3, 12   )     int  ->   r8        
+;  V06 loc3         [V06,T03] (  3, 12   )     int  ->   r9        
 ;# V07 OutArgs      [V07    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->  rcx         "Inlining Arg"
+;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->   r8         "Inlining Arg"
+;  V09 cse0         [V09,T08] (  2,  5   )     int  ->  rax         "CSE - aggressive"
 ;
-; Lcl frame size = 16
+; Lcl frame size = 0

 G_M10589_IG01:              ;; offset=0000H
        55                   push     rbp
-       4883EC10             sub      rsp, 16
-       488D6C2410           lea      rbp, [rsp+10H]
-       8955FC               mov      dword ptr [rbp-04H], edx
-                       ;; bbWeight=1    PerfScore 2.75
-G_M10589_IG02:              ;; offset=000DH
-       33C0                 xor      eax, eax
+       488BEC               mov      rbp, rsp
+                       ;; bbWeight=1    PerfScore 1.25
+G_M10589_IG02:              ;; offset=0004H
+       33C9                 xor      ecx, ecx
        FFCE                 dec      esi
        85F6                 test     esi, esi
-       7C27                 jl       SHORT G_M10589_IG07
+       7C29                 jl       SHORT G_M10589_IG07
+       0FB6C2               movzx    rax, dl
                             align    [0 bytes]
-                       ;; bbWeight=1    PerfScore 1.75
-G_M10589_IG03:              ;; offset=0015H
-       8D1406               lea      edx, [rsi+rax]
+                       ;; bbWeight=1    PerfScore 2.00
+G_M10589_IG03:              ;; offset=000FH
+       8D140E               lea      edx, [rsi+rcx]
        D1EA                 shr      edx, 1
-       4863CA               movsxd   rcx, edx
-       0FB60C0F             movzx    rcx, byte  ptr [rdi+rcx]
-       440FB645FC           movzx    r8, byte  ptr [rbp-04H]
-       442BC1               sub      r8d, ecx
-       7419                 je       SHORT G_M10589_IG09
-       4585C0               test     r8d, r8d
+       4C63C2               movsxd   r8, edx
+       460FB60407           movzx    r8, byte  ptr [rdi+r8]
+       448BC8               mov      r9d, eax
+       452BC8               sub      r9d, r8d
+       7417                 je       SHORT G_M10589_IG09
+       4585C9               test     r9d, r9d
        7E05                 jle      SHORT G_M10589_IG05
-                       ;; bbWeight=4    PerfScore 27.00
-G_M10589_IG04:              ;; offset=0030H
-       8D4201               lea      eax, [rdx+1]
+                       ;; bbWeight=4    PerfScore 24.00
+G_M10589_IG04:              ;; offset=0029H
+       8D4A01               lea      ecx, [rdx+1]
        EB03                 jmp      SHORT G_M10589_IG06
                        ;; bbWeight=2    PerfScore 5.00
-G_M10589_IG05:              ;; offset=0035H
+G_M10589_IG05:              ;; offset=002EH
        8D72FF               lea      esi, [rdx-1]
                        ;; bbWeight=2    PerfScore 1.00
-G_M10589_IG06:              ;; offset=0038H
-       3BC6                 cmp      eax, esi
-       7ED9                 jle      SHORT G_M10589_IG03
+G_M10589_IG06:              ;; offset=0031H
+       3BCE                 cmp      ecx, esi
+       7EDA                 jle      SHORT G_M10589_IG03
                        ;; bbWeight=4    PerfScore 5.00
-G_M10589_IG07:              ;; offset=003CH
+G_M10589_IG07:              ;; offset=0035H
+       8BC1                 mov      eax, ecx
        F7D0                 not      eax
-                       ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG08:              ;; offset=003EH
-       4883C410             add      rsp, 16
+                       ;; bbWeight=0.50 PerfScore 0.25
+G_M10589_IG08:              ;; offset=0039H
        5D                   pop      rbp
        C3                   ret      
-                       ;; bbWeight=0.50 PerfScore 0.88
-G_M10589_IG09:              ;; offset=0044H
+                       ;; bbWeight=0.50 PerfScore 0.75
+G_M10589_IG09:              ;; offset=003BH
        8BC2                 mov      eax, edx
                        ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG10:              ;; offset=0046H
-       4883C410             add      rsp, 16
+G_M10589_IG10:              ;; offset=003DH
        5D                   pop      rbp
        C3                   ret      
-                       ;; bbWeight=0.50 PerfScore 0.88
+                       ;; bbWeight=0.50 PerfScore 0.75

-; Total bytes of code 76, prolog size 13, PerfScore 52.10, instruction count 31, allocated bytes for code 76 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
+; Total bytes of code 63, prolog size 4, PerfScore 46.43, instruction count 29, allocated bytes for code 63 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
 ; ============================================================
jakobbotsch commented 3 years ago

I don't seem to be able to reproduce this:


BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21378.1
  [Host]     : .NET 6.0.0 (6.0.21.37624), X64 RyuJIT
  Job-UQAAIQ : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XZCLYE : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated Code Size
BinarySearch Job-UQAAIQ /new/corerun 512 10.34 ns 0.068 ns 0.063 ns 10.31 ns 10.27 ns 10.45 ns 0.86 - - - - 163 B
BinarySearch Job-XZCLYE /old/corerun 512 12.06 ns 0.083 ns 0.073 ns 12.07 ns 11.94 ns 12.16 ns 1.00 - - - - 176 B

new is 9d9b465f0be92a0d7584c4b8ad6b873be7f0dbc9 while old is the same but with a03a5c8ec376e90a3721fb77504841a5d1b43e13 reverted.

I am invoking:

python3 ./scripts/benchmarks_ci.py -f net6.0 --corerun /home/jbn/dev/dotnet/issues/56402/old/corerun /home/jbn/dev/dotnet/issues/56402/new/corerun --filter 'System.Memory.Span<Byte>*BinarySearch'

@DrewScoggins does that look like the right way to reproduce the issue?

cc @kunalspathak

jakobbotsch commented 3 years ago

I also tried downloading the baseline and compare from the issue and I get the same improvement:


BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21378.1
  [Host]     : .NET 6.0.0 (6.0.21.37624), X64 RyuJIT
  Job-HALSTK : .NET 6.0.0 (6.0.21.37213), X64 RyuJIT
  Job-NWLTBS : .NET 6.0.0 (6.0.21.37303), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Code Size Gen 0 Gen 1 Gen 2 Allocated
BinarySearch Job-HALSTK /baseline/corerun 512 11.454 ns 0.2494 ns 0.2333 ns 11.478 ns 11.187 ns 11.91 ns 1.00 0.00 119 B - - - -
BinarySearch Job-NWLTBS /compare/corerun 512 9.946 ns 0.1749 ns 0.1636 ns 9.973 ns 9.600 ns 10.13 ns 0.87 0.02 106 B - - - -
EgorBo commented 3 years ago

@jakobbotsch I usually invoke benchmarks like this:

cd src\benchmarks\micro
dotnet run -c Release -f net6.0 -- --coreRun /path/to/release/corerun.exe /path/to/release/corerun.exe --filter "*benchName*"

but I guess it's the same as you do. I also pass --iterationTime 500 sometimes.

kunalspathak commented 3 years ago

Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set. I also use -d switch to print the disassembly by the harness.

I did little investigation and it seems to be jcc erratum might be affecting the performance. If you see all test history of windows, it is unstable, so it could be that the improvement that we saw in https://github.com/dotnet/perf-autofiling-issues/issues/28 is just another low of the instability on windows after your change.

image

Here is the diff on windows/x64:

image

On Linux, the benchmark seems pretty consistent with less variation if you see the all test history

image

Here is the diff on linux/x64:

image

If you notice, that both linux and windows has instruction sub that crosses the alignment boundary leading to JCC erratum and it could be that it is more evident in Linux than on Windows.

jakobbotsch commented 3 years ago

Thanks for the additional look @kunalspathak.

Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set. I also use -d switch to print the disassembly by the harness.

I did check the disassembly too, but I didn't know about the jcc erratum. That erratum does not affect my CPU which is quite old (Haswell), so that probably explains why I see improvements.

Have we thought about adding a mitigation for this erratum that can be turned on if the user wants to? E.g. for these automatic reports that could be quite useful.

kunalspathak commented 3 years ago

Have we thought about adding a mitigation for this erratum that can be turned on if the user wants to? E.g. for these automatic reports that could be quite useful.

Yes, we want to add it and there was discussion frequently with @BruceForstall . I will see if I can add something behind a switch so we can try and validate it for benchmarks.

jakobbotsch commented 3 years ago

I tried turning memoryRandomization on to see if I could reproduce the instability. I did --iterationTime 5000 but I got pretty much identical results.

Without memory randomization

Span<Byte>.BinarySearch: Job-ILKRNU(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=5.0000 s, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.755 ns, StdErr = 0.016 ns (0.14%), N = 14, StdDev = 0.058 ns
Min = 10.685 ns, Q1 = 10.708 ns, Median = 10.745 ns, Q3 = 10.782 ns, Max = 10.882 ns
IQR = 0.074 ns, LowerFence = 10.597 ns, UpperFence = 10.892 ns
ConfidenceInterval = [10.689 ns; 10.821 ns] (CI 99.9%), Margin = 0.066 ns (0.61% of Mean)
Skewness = 0.64, Kurtosis = 2.27, MValue = 2
-------------------- Histogram --------------------
[10.654 ns ; 10.913 ns) | @@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  Job-ILKRNU : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=5.0000 s  MaxIterationCount=20  MinIterationCount=15
WarmupCount=1
Method Size Mean Error StdDev Median Min Max Code Size Gen 0 Gen 1 Gen 2 Allocated
BinarySearch 512 10.75 ns 0.066 ns 0.058 ns 10.74 ns 10.69 ns 10.88 ns 163 B - - - -

With memory randomization

// * Detailed results *
Span<Byte>.BinarySearch: Job-KFWQEJ(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=5.0000 s, MaxIterationCount=20, MemoryRandomization=True, MinIterationCount=15, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.694 ns, StdErr = 0.011 ns (0.10%), N = 15, StdDev = 0.041 ns
Min = 10.655 ns, Q1 = 10.665 ns, Median = 10.677 ns, Q3 = 10.716 ns, Max = 10.790 ns
IQR = 0.051 ns, LowerFence = 10.589 ns, UpperFence = 10.792 ns
ConfidenceInterval = [10.651 ns; 10.738 ns] (CI 99.9%), Margin = 0.044 ns (0.41% of Mean)
Skewness = 0.96, Kurtosis = 2.63, MValue = 2
-------------------- Histogram --------------------
[10.633 ns ; 10.811 ns) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  Job-KFWQEJ : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=5.0000 s  MaxIterationCount=20  MemoryRandomization=True
MinIterationCount=15  WarmupCount=1
Method Size Mean Error StdDev Median Min Max Code Size Gen 0 Gen 1 Gen 2 Allocated
BinarySearch 512 10.69 ns 0.044 ns 0.041 ns 10.68 ns 10.65 ns 10.79 ns 163 B - - - -
kunalspathak commented 3 years ago

FYI @adamsitnik - This benchmark does allocate and access array, so I am surprised that memory randomization doesn't show anything. Is there something that we are missing to randomize?

DrewScoggins commented 3 years ago

We are also seeing improvements from this change on Arm64. https://github.com/dotnet/perf-autofiling-issues/issues/76

adamsitnik commented 3 years ago

This benchmark does allocate and access array, so I am surprised that memory randomization doesn't show anything. Is there something that we are missing to randomize?

Most likely we need more iterations (the default is 20). Sth like:

--minIterationCount 100 --maxIterationCount 101
jakobbotsch commented 3 years ago

With that I get:

Without memory randomization

Span<Byte>.BinarySearch: Job-HSGRQU(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=250.0000 ms, MaxIterationCount=101, MinIterationCount=100, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.212 ns, StdErr = 0.007 ns (0.07%), N = 84, StdDev = 0.065 ns
Min = 10.113 ns, Q1 = 10.158 ns, Median = 10.209 ns, Q3 = 10.248 ns, Max = 10.405 ns
IQR = 0.090 ns, LowerFence = 10.023 ns, UpperFence = 10.382 ns
ConfidenceInterval = [10.188 ns; 10.236 ns] (CI 99.9%), Margin = 0.024 ns (0.24% of Mean)
Skewness = 0.73, Kurtosis = 3.19, MValue = 2
-------------------- Histogram --------------------
[10.093 ns ; 10.424 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  Job-HSGRQU : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=250.0000 ms  MaxIterationCount=101  MinIterationCount=100
WarmupCount=1
Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated Code Size
BinarySearch 512 10.21 ns 0.024 ns 0.065 ns 10.21 ns 10.11 ns 10.40 ns - - - - 163 B

With memory randomization

// * Detailed results *
Span<Byte>.BinarySearch: Job-JQMXIT(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=250.0000 ms, MaxIterationCount=101, MemoryRandomization=True, MinIterationCount=100, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.314 ns, StdErr = 0.028 ns (0.27%), N = 100, StdDev = 0.276 ns
Min = 10.118 ns, Q1 = 10.213 ns, Median = 10.269 ns, Q3 = 10.330 ns, Max = 12.166 ns
IQR = 0.117 ns, LowerFence = 10.037 ns, UpperFence = 10.506 ns
ConfidenceInterval = [10.220 ns; 10.407 ns] (CI 99.9%), Margin = 0.094 ns (0.91% of Mean)
Skewness = 5.69, Kurtosis = 37.64, MValue = 2
-------------------- Histogram --------------------
[10.040 ns ; 10.333 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[10.333 ns ; 10.692 ns) | @@@@@@@@@@@@@@@@@@@@@
[10.692 ns ; 11.004 ns) |
[11.004 ns ; 11.317 ns) |
[11.317 ns ; 11.629 ns) |
[11.629 ns ; 12.244 ns) | @@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  Job-JQMXIT : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=250.0000 ms  MaxIterationCount=101  MemoryRandomization=True
MinIterationCount=100  WarmupCount=1
Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated Code Size
BinarySearch 512 10.31 ns 0.094 ns 0.276 ns 10.27 ns 10.12 ns 12.17 ns - - - - 163 B

It definitely looks like there is some effect. I ran some more times and every time the histograms resembled the above: when memory randomization is on there are multiple buckets, while with memory randomization off there is just one bucket.

adamsitnik commented 3 years ago

It definitely looks like there is some effect. I ran some more times and every time the histograms resembled the above: when memory randomization is on there are multiple buckets, while with memory randomization off there is just one bucket.

So it seems to be a memory alignment issue. Since it's an improvement on Windows x64 an Linux ARM64 and it's hard to repro on Linux (in the results provided above by @jakobbotsch it's 1 out of 100 iterations), I think that we can just close this issue.

adamsitnik commented 3 years ago

Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set.

FWIW this is what the benchmarks_ci.py script does OOTB