performanceautofiler[bot] commented 3 years ago

Run Information

Architecture	x64
OS	ubuntu 18.04
Baseline	d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2
Compare	bd3fb963e701743eb6847b898eda8571ead97155
Diff	Diff

Regressions in System.Memory.Span<Byte>

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
[BinarySearch - Duration of single invocation](<https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Memory.Span(Byte).BinarySearch(Size%3a%20512).html>)	9.28 ns	11.80 ns	1.27	0.03	True

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span&lt;Byte&gt;*'

### Payloads [Baseline]() [Compare]() ### Histogram #### System.Memory.Span<Byte>.BinarySearch(Size: 512) ```log ``` ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)

ghost commented 3 years ago

Tagging subscribers to this area: @GrabYourPitchForks, @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details

### Run Information Architecture | x64 -- | -- OS | ubuntu 18.04 Baseline | [d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2](https://github.com/dotnet/runtime/commit/d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2) Compare | [bd3fb963e701743eb6847b898eda8571ead97155](https://github.com/dotnet/runtime/commit/bd3fb963e701743eb6847b898eda8571ead97155) Diff | [Diff](https://github.com/dotnet/runtime/compare/d148c34bdfbd49e3d820eaa4dcbc47832d52d0d2...bd3fb963e701743eb6847b898eda8571ead97155) ### Regressions in System.Memory.Span<Byte> Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- [BinarySearch - Duration of single invocation]() | 9.28 ns | 11.80 ns | 1.27 | 0.03 | True | | | ![graph]() [Historical Data in Reporting System]() ### Repro ```cmd git clone https://github.com/dotnet/performance.git python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span<Byte>*' ```

### Payloads [Baseline]() [Compare]() ### Histogram #### System.Memory.Span<Byte>.BinarySearch(Size: 512) ```log ``` ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)

Author:	performanceautofiler[bot]
Assignees:	-
Labels:	`area-System.Memory`, `untriaged`
Milestone:	-

jakobbotsch commented 3 years ago

The ASM diff is:

diff --git "a/.\\old_disasm.txt" "b/.\\new_disasm.txt"
index c28988d..1bf670f 100644
--- "a/.\\old_disasm.txt"
+++ "b/.\\new_disasm.txt"
@@ -8,62 +8,64 @@
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T05] (  3,  6   )   byref  ->  rcx         single-def
-;  V01 arg1         [V01,T07] (  3,  3   )     int  ->  rdx         single-def
-;  V02 arg2         [V02,T06] (  3,  6   )   ubyte  ->  [rsp+18H]   do-not-enreg[F] ld-addr-op single-def
-;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->  rax        
+;  V01 arg1         [V01,T06] (  3,  3   )     int  ->  rdx         single-def
+;  V02 arg2         [V02,T07] (  3,  3   )   ubyte  ->   r8         ld-addr-op single-def
+;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->   r9        
 ;  V04 loc1         [V04,T02] (  5, 12   )     int  ->  rdx        
 ;  V05 loc2         [V05,T01] (  5, 12.50)     int  ->   r8        
-;  V06 loc3         [V06,T03] (  3, 12   )     int  ->  r10        
+;  V06 loc3         [V06,T03] (  3, 12   )     int  ->  r11        
 ;# V07 OutArgs      [V07    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->   r9         "Inlining Arg"
+;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->  r10         "Inlining Arg"
+;  V09 cse0         [V09,T08] (  2,  5   )     int  ->  rax         "CSE - aggressive"
 ;
 ; Lcl frame size = 0

 G_M10589_IG01:              ;; offset=0000H
-       4489442418           mov      dword ptr [rsp+18H], r8d
-                       ;; bbWeight=1    PerfScore 1.00
-G_M10589_IG02:              ;; offset=0005H
-       33C0                 xor      eax, eax
+                       ;; bbWeight=1    PerfScore 0.00
+G_M10589_IG02:              ;; offset=0000H
+       4533C9               xor      r9d, r9d
        FFCA                 dec      edx
        85D2                 test     edx, edx
-       7C2D                 jl       SHORT G_M10589_IG07
+       7C2F                 jl       SHORT G_M10589_IG07
+       410FB6C0             movzx    rax, r8b
                             align    [0 bytes]
-                       ;; bbWeight=1    PerfScore 1.75
+                       ;; bbWeight=1    PerfScore 2.00
 G_M10589_IG03:              ;; offset=000DH
-       448D0402             lea      r8d, [rdx+rax]
+       468D040A             lea      r8d, [rdx+r9]
        41D1E8               shr      r8d, 1
-       4D63C8               movsxd   r9, r8d
-       460FB60C09           movzx    r9, byte  ptr [rcx+r9]
-       440FB6542418         movzx    r10, byte  ptr [rsp+18H]
-       452BD1               sub      r10d, r9d
-       7416                 je       SHORT G_M10589_IG09
-       4585D2               test     r10d, r10d
+       4D63D0               movsxd   r10, r8d
+       460FB61411           movzx    r10, byte  ptr [rcx+r10]
+       448BD8               mov      r11d, eax
+       452BDA               sub      r11d, r10d
+       741A                 je       SHORT G_M10589_IG09
+       4585DB               test     r11d, r11d
        7E06                 jle      SHORT G_M10589_IG05
-                       ;; bbWeight=4    PerfScore 27.00
-G_M10589_IG04:              ;; offset=002CH
-       418D4001             lea      eax, [r8+1]
+                       ;; bbWeight=4    PerfScore 24.00
+G_M10589_IG04:              ;; offset=0029H
+       458D4801             lea      r9d, [r8+1]
        EB04                 jmp      SHORT G_M10589_IG06
                        ;; bbWeight=2    PerfScore 5.00
-G_M10589_IG05:              ;; offset=0032H
+G_M10589_IG05:              ;; offset=002FH
        418D50FF             lea      edx, [r8-1]
                        ;; bbWeight=2    PerfScore 1.00
-G_M10589_IG06:              ;; offset=0036H
-       3BC2                 cmp      eax, edx
-       7ED3                 jle      SHORT G_M10589_IG03
+G_M10589_IG06:              ;; offset=0033H
+       443BCA               cmp      r9d, edx
+       7ED5                 jle      SHORT G_M10589_IG03
                        ;; bbWeight=4    PerfScore 5.00
-G_M10589_IG07:              ;; offset=003AH
+G_M10589_IG07:              ;; offset=0038H
+       418BC1               mov      eax, r9d
        F7D0                 not      eax
-                       ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG08:              ;; offset=003CH
+                       ;; bbWeight=0.50 PerfScore 0.25
+G_M10589_IG08:              ;; offset=003DH
        C3                   ret      
                        ;; bbWeight=0.50 PerfScore 0.50
-G_M10589_IG09:              ;; offset=003DH
+G_M10589_IG09:              ;; offset=003EH
        418BC0               mov      eax, r8d
                        ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG10:              ;; offset=0040H
+G_M10589_IG10:              ;; offset=0041H
        C3                   ret      
                        ;; bbWeight=0.50 PerfScore 0.50

-; Total bytes of code 65, prolog size 5, PerfScore 48.50, instruction count 24, allocated bytes for code 65 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
+; Total bytes of code 66, prolog size 0, PerfScore 44.98, instruction count 25, allocated bytes for code 66 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
 ; ============================================================

We enregister the byte arg now and the inner loop looks almost equal. There is a reg-reg mov instead of a memory load but I don't see why this would regress performance. I'll take a closer look tomorrow, this disasm is with R2R disabled, maybe the R2R code is different.

jakobbotsch commented 3 years ago

Oops, I am looking at Windows which indeed saw improvements. Will check the Ubuntu diff tomorrow.

jakobbotsch commented 3 years ago

The codegen changes for Unix look similar, at least with alt-jit:

diff --git "a/.\\old_disasm.txt" "b/.\\new_disasm.txt"
index 7dc58c6..c22ddca 100644
--- "a/.\\old_disasm.txt"
+++ "b/.\\new_disasm.txt"
@@ -1,77 +1,76 @@
 ; Assembly listing for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
 ; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
 ; optimized code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
 ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
 ; invoked as altjit
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T05] (  3,  6   )   byref  ->  rdi         single-def
-;  V01 arg1         [V01,T07] (  3,  3   )     int  ->  rsi         single-def
-;  V02 arg2         [V02,T06] (  3,  6   )   ubyte  ->  [rbp-04H]   do-not-enreg[F] ld-addr-op single-def
-;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->  rax        
+;  V01 arg1         [V01,T06] (  3,  3   )     int  ->  rsi         single-def
+;  V02 arg2         [V02,T07] (  3,  3   )   ubyte  ->  rdx         ld-addr-op single-def
+;  V03 loc0         [V03,T04] (  5, 11.50)     int  ->  rcx        
 ;  V04 loc1         [V04,T02] (  5, 12   )     int  ->  rsi        
 ;  V05 loc2         [V05,T01] (  5, 12.50)     int  ->  rdx        
-;  V06 loc3         [V06,T03] (  3, 12   )     int  ->   r8        
+;  V06 loc3         [V06,T03] (  3, 12   )     int  ->   r9        
 ;# V07 OutArgs      [V07    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->  rcx         "Inlining Arg"
+;  V08 tmp1         [V08,T00] (  2, 16   )   ubyte  ->   r8         "Inlining Arg"
+;  V09 cse0         [V09,T08] (  2,  5   )     int  ->  rax         "CSE - aggressive"
 ;
-; Lcl frame size = 16
+; Lcl frame size = 0

 G_M10589_IG01:              ;; offset=0000H
        55                   push     rbp
-       4883EC10             sub      rsp, 16
-       488D6C2410           lea      rbp, [rsp+10H]
-       8955FC               mov      dword ptr [rbp-04H], edx
-                       ;; bbWeight=1    PerfScore 2.75
-G_M10589_IG02:              ;; offset=000DH
-       33C0                 xor      eax, eax
+       488BEC               mov      rbp, rsp
+                       ;; bbWeight=1    PerfScore 1.25
+G_M10589_IG02:              ;; offset=0004H
+       33C9                 xor      ecx, ecx
        FFCE                 dec      esi
        85F6                 test     esi, esi
-       7C27                 jl       SHORT G_M10589_IG07
+       7C29                 jl       SHORT G_M10589_IG07
+       0FB6C2               movzx    rax, dl
                             align    [0 bytes]
-                       ;; bbWeight=1    PerfScore 1.75
-G_M10589_IG03:              ;; offset=0015H
-       8D1406               lea      edx, [rsi+rax]
+                       ;; bbWeight=1    PerfScore 2.00
+G_M10589_IG03:              ;; offset=000FH
+       8D140E               lea      edx, [rsi+rcx]
        D1EA                 shr      edx, 1
-       4863CA               movsxd   rcx, edx
-       0FB60C0F             movzx    rcx, byte  ptr [rdi+rcx]
-       440FB645FC           movzx    r8, byte  ptr [rbp-04H]
-       442BC1               sub      r8d, ecx
-       7419                 je       SHORT G_M10589_IG09
-       4585C0               test     r8d, r8d
+       4C63C2               movsxd   r8, edx
+       460FB60407           movzx    r8, byte  ptr [rdi+r8]
+       448BC8               mov      r9d, eax
+       452BC8               sub      r9d, r8d
+       7417                 je       SHORT G_M10589_IG09
+       4585C9               test     r9d, r9d
        7E05                 jle      SHORT G_M10589_IG05
-                       ;; bbWeight=4    PerfScore 27.00
-G_M10589_IG04:              ;; offset=0030H
-       8D4201               lea      eax, [rdx+1]
+                       ;; bbWeight=4    PerfScore 24.00
+G_M10589_IG04:              ;; offset=0029H
+       8D4A01               lea      ecx, [rdx+1]
        EB03                 jmp      SHORT G_M10589_IG06
                        ;; bbWeight=2    PerfScore 5.00
-G_M10589_IG05:              ;; offset=0035H
+G_M10589_IG05:              ;; offset=002EH
        8D72FF               lea      esi, [rdx-1]
                        ;; bbWeight=2    PerfScore 1.00
-G_M10589_IG06:              ;; offset=0038H
-       3BC6                 cmp      eax, esi
-       7ED9                 jle      SHORT G_M10589_IG03
+G_M10589_IG06:              ;; offset=0031H
+       3BCE                 cmp      ecx, esi
+       7EDA                 jle      SHORT G_M10589_IG03
                        ;; bbWeight=4    PerfScore 5.00
-G_M10589_IG07:              ;; offset=003CH
+G_M10589_IG07:              ;; offset=0035H
+       8BC1                 mov      eax, ecx
        F7D0                 not      eax
-                       ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG08:              ;; offset=003EH
-       4883C410             add      rsp, 16
+                       ;; bbWeight=0.50 PerfScore 0.25
+G_M10589_IG08:              ;; offset=0039H
        5D                   pop      rbp
        C3                   ret      
-                       ;; bbWeight=0.50 PerfScore 0.88
-G_M10589_IG09:              ;; offset=0044H
+                       ;; bbWeight=0.50 PerfScore 0.75
+G_M10589_IG09:              ;; offset=003BH
        8BC2                 mov      eax, edx
                        ;; bbWeight=0.50 PerfScore 0.13
-G_M10589_IG10:              ;; offset=0046H
-       4883C410             add      rsp, 16
+G_M10589_IG10:              ;; offset=003DH
        5D                   pop      rbp
        C3                   ret      
-                       ;; bbWeight=0.50 PerfScore 0.88
+                       ;; bbWeight=0.50 PerfScore 0.75

-; Total bytes of code 76, prolog size 13, PerfScore 52.10, instruction count 31, allocated bytes for code 76 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
+; Total bytes of code 63, prolog size 4, PerfScore 46.43, instruction count 29, allocated bytes for code 63 (MethodHash=56dcd6a2) for method System.SpanHelpers:BinarySearch(byref,int,ubyte):int
 ; ============================================================

jakobbotsch commented 3 years ago

I don't seem to be able to reproduce this:


BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21378.1
  [Host]     : .NET 6.0.0 (6.0.21.37624), X64 RyuJIT
  Job-UQAAIQ : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XZCLYE : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen 0	Gen 1	Gen 2	Allocated	Code Size
BinarySearch	Job-UQAAIQ	/new/corerun	512	10.34 ns	0.068 ns	0.063 ns	10.31 ns	10.27 ns	10.45 ns	0.86	-	-	-	-	163 B
BinarySearch	Job-XZCLYE	/old/corerun	512	12.06 ns	0.083 ns	0.073 ns	12.07 ns	11.94 ns	12.16 ns	1.00	-	-	-	-	176 B

new is 9d9b465f0be92a0d7584c4b8ad6b873be7f0dbc9 while old is the same but with a03a5c8ec376e90a3721fb77504841a5d1b43e13 reverted.

I am invoking:

python3 ./scripts/benchmarks_ci.py -f net6.0 --corerun /home/jbn/dev/dotnet/issues/56402/old/corerun /home/jbn/dev/dotnet/issues/56402/new/corerun --filter 'System.Memory.Span<Byte>*BinarySearch'

@DrewScoggins does that look like the right way to reproduce the issue?

cc @kunalspathak

jakobbotsch commented 3 years ago

I also tried downloading the baseline and compare from the issue and I get the same improvement:


BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21378.1
  [Host]     : .NET 6.0.0 (6.0.21.37624), X64 RyuJIT
  Job-HALSTK : .NET 6.0.0 (6.0.21.37213), X64 RyuJIT
  Job-NWLTBS : .NET 6.0.0 (6.0.21.37303), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Code Size	Gen 0	Gen 1	Gen 2	Allocated
BinarySearch	Job-HALSTK	/baseline/corerun	512	11.454 ns	0.2494 ns	0.2333 ns	11.478 ns	11.187 ns	11.91 ns	1.00	0.00	119 B	-	-	-	-
BinarySearch	Job-NWLTBS	/compare/corerun	512	9.946 ns	0.1749 ns	0.1636 ns	9.973 ns	9.600 ns	10.13 ns	0.87	0.02	106 B	-	-	-	-

EgorBo commented 3 years ago

@jakobbotsch I usually invoke benchmarks like this:

cd src\benchmarks\micro
dotnet run -c Release -f net6.0 -- --coreRun /path/to/release/corerun.exe /path/to/release/corerun.exe --filter "*benchName*"

but I guess it's the same as you do. I also pass --iterationTime 500 sometimes.

kunalspathak commented 3 years ago

Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set. I also use -d switch to print the disassembly by the harness.

I did little investigation and it seems to be jcc erratum might be affecting the performance. If you see all test history of windows, it is unstable, so it could be that the improvement that we saw in https://github.com/dotnet/perf-autofiling-issues/issues/28 is just another low of the instability on windows after your change.

Here is the diff on windows/x64:

On Linux, the benchmark seems pretty consistent with less variation if you see the all test history

Here is the diff on linux/x64:

If you notice, that both linux and windows has instruction sub that crosses the alignment boundary leading to JCC erratum and it could be that it is more evident in Linux than on Windows.

jakobbotsch commented 3 years ago

Thanks for the additional look @kunalspathak.

Make sure that you do not have any COMPlus_ variables like Tier=0 or ReadyToRun=0 set. I also use -d switch to print the disassembly by the harness.

I did check the disassembly too, but I didn't know about the jcc erratum. That erratum does not affect my CPU which is quite old (Haswell), so that probably explains why I see improvements.

Have we thought about adding a mitigation for this erratum that can be turned on if the user wants to? E.g. for these automatic reports that could be quite useful.

kunalspathak commented 3 years ago

Have we thought about adding a mitigation for this erratum that can be turned on if the user wants to? E.g. for these automatic reports that could be quite useful.

Yes, we want to add it and there was discussion frequently with @BruceForstall . I will see if I can add something behind a switch so we can try and validate it for benchmarks.

jakobbotsch commented 3 years ago

I tried turning memoryRandomization on to see if I could reproduce the instability. I did --iterationTime 5000 but I got pretty much identical results.

Without memory randomization

Span<Byte>.BinarySearch: Job-ILKRNU(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=5.0000 s, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.755 ns, StdErr = 0.016 ns (0.14%), N = 14, StdDev = 0.058 ns
Min = 10.685 ns, Q1 = 10.708 ns, Median = 10.745 ns, Q3 = 10.782 ns, Max = 10.882 ns
IQR = 0.074 ns, LowerFence = 10.597 ns, UpperFence = 10.892 ns
ConfidenceInterval = [10.689 ns; 10.821 ns] (CI 99.9%), Margin = 0.066 ns (0.61% of Mean)
Skewness = 0.64, Kurtosis = 2.27, MValue = 2
-------------------- Histogram --------------------
[10.654 ns ; 10.913 ns) | @@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  Job-ILKRNU : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=5.0000 s  MaxIterationCount=20  MinIterationCount=15
WarmupCount=1

Method	Size	Mean	Error	StdDev	Median	Min	Max	Code Size	Gen 0	Gen 1	Gen 2	Allocated
BinarySearch	512	10.75 ns	0.066 ns	0.058 ns	10.74 ns	10.69 ns	10.88 ns	163 B	-	-	-	-

With memory randomization

// * Detailed results *
Span<Byte>.BinarySearch: Job-KFWQEJ(PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable, Toolchain=CoreRun, IterationTime=5.0000 s, MaxIterationCount=20, MemoryRandomization=True, MinIterationCount=15, WarmupCount=1) [Size=512]
Runtime = .NET 6.0.0 (42.42.42.42424), X64 RyuJIT; GC = Concurrent Workstation
Mean = 10.694 ns, StdErr = 0.011 ns (0.10%), N = 15, StdDev = 0.041 ns
Min = 10.655 ns, Q1 = 10.665 ns, Median = 10.677 ns, Q3 = 10.716 ns, Max = 10.790 ns
IQR = 0.051 ns, LowerFence = 10.589 ns, UpperFence = 10.792 ns
ConfidenceInterval = [10.651 ns; 10.738 ns] (CI 99.9%), Margin = 0.044 ns (0.41% of Mean)
Skewness = 0.96, Kurtosis = 2.63, MValue = 2
-------------------- Histogram --------------------
[10.633 ns ; 10.811 ns) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.0.1559-nightly, OS=ubuntu 18.04
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  Job-KFWQEJ : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=5.0000 s  MaxIterationCount=20  MemoryRandomization=True
MinIterationCount=15  WarmupCount=1

Method	Size	Mean	Error	StdDev	Median	Min	Max	Code Size	Gen 0	Gen 1	Gen 2	Allocated
BinarySearch	512	10.69 ns	0.044 ns	0.041 ns	10.68 ns	10.65 ns	10.79 ns	163 B	-	-	-	-

kunalspathak commented 3 years ago

FYI @adamsitnik - This benchmark does allocate and access array, so I am surprised that memory randomization doesn't show anything. Is there something that we are missing to randomize?

DrewScoggins commented 3 years ago

We are also seeing improvements from this change on Arm64. https://github.com/dotnet/perf-autofiling-issues/issues/76

adamsitnik commented 3 years ago

This benchmark does allocate and access array, so I am surprised that memory randomization doesn't show anything. Is there something that we are missing to randomize?

Most likely we need more iterations (the default is 20). Sth like:

--minIterationCount 100 --maxIterationCount 101

jakobbotsch commented 3 years ago

With that I get: