dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Investigate microbenchmarks that regress with PGO enabled #84264

Closed AndyAyersMS closed 1 year ago

AndyAyersMS commented 1 year ago

Look into cases where microbenchmarks are slower with PGO enabled.

https://github.com/dotnet/runtime/issues/74873#issuecomment-1492803245 has an initial list to check.


Issues uncovered

Fixes

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak See info in area-owners.md if you want to be subscribed.

Issue Details
Look into cases where microbenchmarks are slower with PGO enabled. https://github.com/dotnet/runtime/issues/74873#issuecomment-1492803245 has an initial list to check.
Author: AndyAyersMS
Assignees: AndyAyersMS
Labels: `area-CodeGen-coreclr`
Milestone: -
AndyAyersMS commented 1 year ago

PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo

Default [link]

newplot - 2023-04-03T121123 088

PGO [link]

newplot - 2023-04-03T121343 443

Improvements in both on 12/4 were from #79060. Regression in PGO on 1/13 was from #80481.

Looks like quite a few tests were impacted by #80481. So I'll just focus on that for now.

EgorBo commented 1 year ago

@AndyAyersMS I think that CastHelpers should not be instrumented since they were already carefully optimized for a general case by @VSadov e.g. https://github.com/dotnet/runtime/blob/e6226e675a63b5d9e41f43492a13556fd4053b25/src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs#L277-L322

EgorBo commented 1 year ago

PS: we can probably propose a new API [MethodImpl(MethodImplOptions.NoInstrumenting)]

VSadov commented 1 year ago

Yes we are certain that these methods will be hot and want the best possible codegen right away.

The cast helpers are also initialized in a special way - https://github.com/dotnet/runtime/blob/4cbe6f99d23e04c56a89251d49de1b0f14000427/src/coreclr/vm/ecall.cpp#L99

I do not remember all the details, but there were reasons for that.
Maybe it is possible to disable instrumentation for things initialized that way. (I do not have preferences, just want mention that there could be another way to detect and special case these)

EgorBo commented 1 year ago

Yes we are certain that these methods will be hot and want the best possible codegen right away.

The cast helpers are also initialized in a special way -

https://github.com/dotnet/runtime/blob/4cbe6f99d23e04c56a89251d49de1b0f14000427/src/coreclr/vm/ecall.cpp#L99

I do not remember all the details, but there were reasons for that. Maybe it is possible to disable instrumentation for things initialized that way. (I do not have preferences, just want mention that there could be another way to detect and special case these)

The ones you listed have AggressiveOptimization attribute, am I right? In that case we won't instrument them, but there are other (less important) members in CastHelpers which don't have AggressiveOptimization so are eligble for instrumentation

VSadov commented 1 year ago

All the cast helpers are initialized in that method. If you have other uses and reasons to introduce MethodImplOptions.NoInstrumenting, then this part is unimportant.

It would only be interesting if you somehow do not want MethodImplOptions.NoInstrumenting just for a few special methods and need another way to identify them.

AndyAyersMS commented 1 year ago

The issue caused by https://github.com/dotnet/runtime/pull/80481 is with benchmark methods like PerfLabTests.CastingPerf:CheckArrayIsInterfaceNo that trigger OSR by looping a lot internally.

For such methods we never see any Tier0 + instr method return, so with sparse profiles enabled Tier1 profile reconstruction thinks the entry block is run rarely, and this messes up optimizations. eg in this case we fail to hoist/cse the cast out of the loop.

Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0          0    [000..006)-> BB03 (always)                     rare IBC 
BB02 [0001]  1       BB03                 5630k 5630382    [006..018)                                     bwd bwd-target IBC 
BB03 [0002]  2       BB01,BB02            5630k 5630382    [018..020)-> BB02 ( cond )                     bwd bwd-src IBC 
BB04 [0003]  1       BB03                  0          0    [020..022)        (return)                     rare IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

The fix I'm proposing is to run synthesis repair in such cases. It addresses the above case, looking now to see what else it does.

AndyAyersMS commented 1 year ago

Also noticed #84319 while reviewing the dumps for CheckArrayIsInterfaceNo.

AndyAyersMS commented 1 year ago

After #84312 newplot - 2023-04-06T130742 053

Updated table of worst-case results (just 1 day of data, so possibly noisy, eg PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo is bimodal but hasn't shown fast results for PGO yet).

[now showing the "fast time"]

newplot (7)

Investigated cases in bold, notes linked -- skipping tests in italics that look similar to ones already analyzed.

TestName Pgo/NoPgo Note
System.Text.Json.Tests.Perf_Get.GetByte [notes] 0.378284924 budget #85531
System.Text.Json.Tests.Perf_Get.GetUInt16 0.39703163 dup
System.Text.Json.Tests.Perf_Get.GetInt16 0.415271098 dup
System.Text.Json.Tests.Perf_Get.GetSByte 0.432559076 dup
System.Text.Json.Tests.Perf_Get.GetUInt32 0.485751995 dup
System.Text.Json.Tests.Perf_Get.GetInt32 0.496368517 dup
PerfLabTests.CastingPerf.CheckArrayIsInterfaceYes 0.500523372 dup
PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo [notes] improvements 0.500571243 fixed https://github.com/dotnet/runtime/pull/84312
System.Collections.IterateFor<String>.ImmutableArray(Size: 512) [notes] 0.521820448 suspect noise
Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware) [notes] 0.587834042 noise
Exceptions.Handling.ThrowAndCatchWhenFinally(kind: Software) 0.629162936 dup
System.Collections.Sort<Int32>.Array(Size: 512) [notes] 0.691174178 layout
Exceptions.Handling.ThrowAndCatch(kind: ReflectionSoftware) 0.712891743 dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen) [notes] 0.714508317 budget #85531
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToUtf8Bytes(Mode: SourceGen) 0.716100799 dup
System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512) [notes] 0.716735601 deferred
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToString(Mode: SourceGen) 0.72480144 dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen) 0.728309251 dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeObjectProperty(Mode: SourceGen) 0.732178172 dup
Exceptions.Handling.ThrowAndCatch_ManyCatchBlocks(kind: ReflectionSoftware) 0.734050863 dup
Benchstone.BenchI.MulMatrix.Test [notes] 0.746493091 repair https://github.com/dotnet/runtime/pull/84741
LinqBenchmarks.Count00ForX [notes] 0.75390536 #84517
Benchstone.MDBenchI.MDLogicArray.Test [notes] 0.755000226 repair
Span.Sorting.QuickSortSpan(Size: 512) [notes] 0.755771373 layout
System.IO.Tests.Perf_Path.GetDirectoryName [notes] 0.762018237 unclear
System.Collections.CtorFromCollection<Int32>.SortedSet(Size: 512) [notes] 0.784138758 noise
MicroBenchmarks.Serializers.XmlToStream<MyEventsListerViewModel>.XmlSerializer [notes] 0.785627854 noise
System.Text.Json.Tests.Perf_Segment.ReadMultiSegmentSequence(segmentSize: 4096, TestCase: Json4KB) [notes] 0.788119346 budget #85531
System.Text.Json.Tests.Perf_Get.GetUInt64 0.791121821 dup
System.Memory.Span<Byte>.SequenceEqual(Size: 33) [notes] 0.791880016 intrinsic https://github.com/dotnet/runtime/pull/85130
System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, IgnoreNonSpace)) [notes] 0.798755609 noise
Benchstone.MDBenchF.MDInvMt.Test [notes] 0.80080933 Fixed https://github.com/dotnet/runtime/pull/84741
PerfLabTests.CastingPerf.ObjObjIsFoo [notes] 0.806031322 noise
System.Memory.Span<Byte>.StartsWith(Size: 512) [notes] 0.806253739 noise
Exceptions.Handling.TryAndFinallyDeep(kind: Software) 0.81026551 dup
System.Collections.ContainsFalse<String>.ImmutableArray(Size: 512) [notes] 0.812478704 noise
System.Memory.Span<Byte>.SequenceEqual(Size: 512) 0.821172947 dup
Benchstone.BenchF.InvMt.Test [notes] 0.822482883 fixed https://github.com/dotnet/runtime/pull/84741
ByteMark.BenchLUDecomp [notes] 0.822880993 blend #85171
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj [notes] 0.82354731 fixed
System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4) [notes] 0.834542277 LSRA block order
PerfLabTests.LowLevelPerf.GenericGenericMethod [notes] 0.836391889 noise
System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB) [notes] 0.836471477 unclear
System.Perf_Convert.FromBase64String [notes] 0.839094252 maintenance #85265
System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 400000, unique: 1, cacheSize: 15) [notes] 0.841496573 noise
Benchstone.BenchI.CSieve.Test [link] 0.842229476 Fixed https://github.com/dotnet/runtime/pull/84741
System.Collections.IndexerSet<Int32>.Dictionary(Size: 512) [notes] 0.842651326 noise ?
System.Text.Json.Serialization.Tests.WriteJson<ImmutableSortedDictionary<String, String>>.SerializeToWriter(Mode: SourceGen) 0.84525902 dup
System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!") [notes] 0.846315396 noise
System.Text.Json.Tests.Perf_Get.GetInt64 0.847824295 dup

AndyAyersMS commented 1 year ago

System.Text.Json.Tests.Perf_Get.GetByte

Default [link]

newplot - 2023-04-10T082229 376

51.97%   2.616E+07   Tier-1   [System.Private.CoreLib]Utf8Parser.TryParseByteD(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&)
33.04%   1.663E+07   Tier-1   [MicroBenchmarks]Perf_Get.GetByte()
09.30%   4.68E+06    Tier-1   [System.Private.CoreLib]Utf8Parser.TryParse(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&,wchar)
02.22%   1.12E+06    Tier-1   [System.Text.Json]Utf8JsonReader..ctor(value class System.ReadOnlySpan`1<unsigned int8>,bool,value class System.Text.Json.JsonReaderState)

PGO [link]

newplot - 2023-04-10T082207 409

45.71%   2.297E+07   Tier-1   [System.Text.Json]Utf8JsonReader.TryGetByte(unsigned int8&)
31.92%   1.604E+07   Tier-1   [MicroBenchmarks]Perf_Get.GetByte()
19.36%   9.73E+06    Tier-1   [System.Private.CoreLib]Utf8Parser.TryParseByteD(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&)
00.84%   4.2E+05     Tier-1   [System.Text.Json]Utf8JsonReader..ctor(value class System.ReadOnlySpan`1<unsigned int8>,bool,value class System.Text.Json.JsonReaderState)
00.66%   3.3E+05     Tier-1   [System.Text.Json]Utf8JsonReader.TryGetNumber(value class System.ReadOnlySpan`1<unsigned int8>,int32&)

PGO is unable to inline TryGetByte into Perf_Get.GetByte because the inliner runs out of budget doing a lot of earlier inlines that are less profitable. If I increase the budget to 12x (currently it's 10x) then PGO and non-PGO performance matches.

    [INL00 IL=0003 TR=000894 060003D1] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:TryGetByte(byref):bool:this

I also noticed Perf_Get.GetByte was jitted twice at instrumented tier0 (#84517)

 827: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Instrumented Tier0, IL size=62, code size=332, hash=0xb3d5b836]
 907: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Instrumented Tier0, IL size=62, code size=299, hash=0xb3d5b836]
 939: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Tier1 with Dynamic PGO, IL size=62, code size=1802, hash=0xb3d5b836]
AndyAyersMS commented 1 year ago

System.Collections.IterateFor.ImmutableArray(Size: 512)

Default [link]

newplot - 2023-04-10T101203 658

PGO [link]

newplot - 2023-04-10T101002 710


This one seems sporadically bimodal. Good results show PGO and default are at par.

My local runs have PGO faster than default. Will defer investigation for a while to see if PGO runs ever go back to the "fast" level.


Latest runs show PGO perf has indeed come back down newplot (18)

AndyAyersMS commented 1 year ago

Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware)

Default [link]

newplot

PGO [link]

newplot (1)


Also seems bimodal and theres no real difference between non-PGO and PGO

AndyAyersMS commented 1 year ago

System.Collections.Sort.Array(Size: 512)

Default [link]

newplot (2)

PGO [link]

newplot (3)


Both fluctuate, but PGO has worse behavior.

By default this test doesn't iterate enough and is mainly measuring R2R codegen. Sadly both perfview and my tool are unable to resolve the R2R method. Not sure why.

51.77%   3.07E+06    native   System.Private.CoreLib.dll
27.66%   1.64E+06    Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].PickPivotAndPartition(value class System.Span`1<!0>)
14.84%   8.8E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].InsertionSort(value class System.Span`1<!0>)
02.53%   1.5E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].IntroSort(value class System.Span`1<!0>,int32)

If I bump iteration count up to 200 there is a more consistent small regression with PGO

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Array Job-YGWDQU Default 512 4.594 us 0.1309 us 0.5542 us 4.455 us 4.120 us 8.563 us 1.00 0.00 - NA
Array Job-AEEVAG PGO 512 5.128 us 0.1241 us 0.5255 us 5.201 us 2.996 us 7.711 us 1.13 0.16 - NA

And it looks like it is related to the codegen in PickPivotAndPartition.

Perf vs BDN iteration number for PGO, By default the lab does 15 iterations and IIRC we report median so you can see it might toggle back and forth between fast and slow results like we are seeing above, depending exactly on when the switchover happens.

image

Seems like PGO behavior changed as a result of https://github.com/dotnet/runtime/pull/80769.

Issue seems to be that fgReorderBlocks sees slightly different profile weights and leaves a colder block BB10 intermixed with hotter ones:

;; default (before)

BB10 [0092]  1       BB08                  0.31   33    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.31   33    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.31   33    [029..02A)-> BB05 (always)                     i IBC 
BB20 [0014]  2       BB11,BB18             1     108    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  1     108    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1     108    [125..137)        (return)                     i IBC 

;; default (after)

BB20 [0014]  2       BB11,BB18             1     108    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  1     108    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1     108    [125..137)        (return)                     i IBC 
BB10 [0092]  1       BB08                  0.31   33    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.31   33    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.31   33    [029..02A)-> BB05 (always)                     i IBC 

;; pgo (before)

BB10 [0092]  1       BB08                  0.33   19513    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.33   19513    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.33   19513    [029..02A)-> BB05 (always)                     i IBC 
BB20 [0014]  2       BB11,BB18             1      59808    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  0.98   58720    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1      59808    [125..137)        (return)                     i IBC 

;; pgo (after)

BB20 [0014]  2       BB11,BB18             1      59808    [113..11D)-> BB22 ( cond )                     i IBC 
BB23 [0193]  1       BB20                  0.98            [???..???)-> BB21 (always)                     internal 
BB10 [0092]  1       BB08                  0.33   19513    [037..038)-> BB11 (always)                     i IBC 
BB21 [0015]  1       BB23                  0.98   58720    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1      59808    [125..137)        (return)                     i IBC 
BB07 [0056]  1       BB05                  0.33   19513    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.33   19513    [029..02A)-> BB05 (always)                     i IBC 

The default case is using static PGO; from what I can tell all the counts are proper so this is just a divergence in profile data leading to a change in code layout.

There doesn't seem to be an easy way to tweak this algorithm, so going to leave the analysis here and move on.

AndyAyersMS commented 1 year ago

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen)

Default [link].SerializeObjectProperty(Mode:%20SourceGen).html)

newplot (5)

PGO [link].SerializeObjectProperty(Mode:%20SourceGen).html)

newplot (6)

Locally

Method Job Toolchain Mode Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
SerializeToWriter Job-GYSUOG Default SourceGen 9.702 us 0.4368 us 0.5030 us 9.678 us 9.033 us 11.18 us 1.00 0.00 - NA
SerializeToWriter Job-HDVYVV PGO SourceGen 13.188 us 0.4572 us 0.5265 us 13.011 us 12.641 us 14.62 us 1.36 0.07 - NA

Suspect this is another case where we run out of budget during inlining. The root level method that contains all the benchmark bits is MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String])

and the respecive ETL profiles are

;; default

02.56%   1.31E+06    ?        Unknown
20.78%   1.065E+07   Tier-1   [MicroBenchmarks]SystemTextJsonSourceGeneratedContext.ImmutableDictionaryStringStringSerializeHandler(class System.Text.Json.Utf8JsonWriter,class System.Collections.Immutable.ImmutableDictionary`2<class System.String,class System.String>)
12.69%   6.5E+06     Tier-1   [System.Text.Json]Utf8JsonWriter.WriteStringMinimized(value class System.ReadOnlySpan`1<wchar>,value class System.ReadOnlySpan`1<wchar>)
12.57%   6.44E+06    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncodeSsse3(wchar*,unsigned int)
08.39%   4.3E+06     Tier-1   [System.Private.CoreLib]Utf8.FromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&,bool,bool)
07.59%   3.89E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii(wchar*,unsigned int8*,unsigned int)
05.35%   2.74E+06    Tier-1   [System.Private.CoreLib]Utf8Utility.TranscodeToUtf8(wchar*,int32,unsigned int8*,int32,wchar*&,unsigned int8*&)
05.21%   2.67E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].PushLeft(class System.Collections.Immutable.SortedInt32KeyNode`1<!0>)
04.37%   2.24E+06    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncode(value class System.ReadOnlySpan`1<wchar>)
04.25%   2.18E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+HashBucket+Enumerator[System.__Canon,System.__Canon].MoveNext()
03.40%   1.74E+06    Tier-1   [System.Text.Encodings.Web]DefaultJavaScriptEncoder.FindFirstCharacterToEncode(wchar*,int32)
03.28%   1.68E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii_Intrinsified(wchar*,unsigned int8*,unsigned int)
03.10%   1.59E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].Use(!!0&)
02.05%   1.05E+06    Tier-1   [System.Collections]System.Collections.Generic.Stack`1[System.Collections.Immutable.RefAsValueType`1[System.__Canon]].Pop()
02.03%   1.04E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].IsOwned(!!0&)

;; pgo

02.87%   2.07E+06    ?        Unknown
16.01%   1.154E+07   Tier-1   [System.Text.Json]Utf8JsonWriter.WriteString(class System.String,class System.String)
15.82%   1.14E+07    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncode(value class System.ReadOnlySpan`1<wchar>)
08.81%   6.35E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].MoveNext()
07.15%   5.15E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon].MoveNext()
05.70%   4.11E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].get_Value()
05.61%   4.04E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii(wchar*,unsigned int8*,unsigned int)
05.54%   3.99E+06    Tier-1   [System.Private.CoreLib]Utf8.FromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&,bool,bool)
05.07%   3.65E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].IsOwned(!!0&)
04.88%   3.52E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+HashBucket+Enumerator[System.__Canon,System.__Canon].MoveNext()
04.44%   3.2E+06     native   coreclr.dll
03.84%   2.77E+06    Tier-1   [System.Private.CoreLib]Utf8Utility.TranscodeToUtf8(wchar*,int32,unsigned int8*,int32,wchar*&,unsigned int8*&)
03.64%   2.62E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].Use(!!0&)
03.03%   2.18E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].get_Current()
02.44%   1.76E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii_Intrinsified(wchar*,unsigned int8*,unsigned int)
02.01%   1.45E+06    Tier-1   [MicroBenchmarks]SystemTextJsonSourceGeneratedContext.ImmutableDictionaryStringStringSerializeHandler(class System.Text.Json.Utf8JsonWriter,class System.Collections.Immutable.ImmutableDictionary`2<class System.String,class System.String>)
00.97%   7E+05       Tier-1   [System.Collections]System.Collections.Generic.Stack`1[System.Collections.Immutable.RefAsValueType`1[System.__Canon]].Pop()
00.93%   6.7E+05     Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].System.Collections.Immutable.ISecurePooledObjectUser.get_PoolUserId()

These are tricky to match up because of drastic inlining differences, but if we just look at top level call sites in the root method, we see

;; default

Inlines into 06001DF8 [via ExtendedDefaultPolicy] MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String]):
  [INL00 IL=0004 TR=000045 0600069F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonWriter:WriteNullValue():this
  [INL01 IL=0011 TR=000005 06000564] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteStartObject():this
  [INL02 IL=0017 TR=000007 0600027A] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Immutable.ImmutableDictionary`2[System.__Canon,System.__Canon]:GetEnumerator():System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:this
  [INL23 IL=0027 TR=000019 060002F7] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:get_Current():System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:this
  [INL33 IL=0036 TR=000026 06007DD0] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Key():System.__Canon:this
  [INL34 IL=0043 TR=000030 06007DD1] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Value():System.__Canon:this
  [INL35 IL=0048 TR=000036 06000635] [INLINED: call site: profitable inline] System.Text.Json.Utf8JsonWriter:WriteString(System.String,System.String):this
  [INL81 IL=0055 TR=000012 060002F9] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:MoveNext():bool:this
  [INL101 IL=0072 TR=000041 060002FB] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:Dispose():this
  [INL109 IL=0079 TR=000038 0600057A] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteEndObject():this

Budget: initialTime=315, finalTime=3147, initialBudget=3150, currentBudget=3150
Budget: discretionary inline caused a force inline
Budget: initialSize=2069, finalSize=18734

;; pgo

Inlines into 06001DF8 [via ExtendedDefaultPolicy] MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String]):
  [INL00 IL=0004 TR=000045 0600069F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonWriter:WriteNullValue():this
  [INL01 IL=0011 TR=000005 06000564] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteStartObject():this
  [INL56 IL=0017 TR=000007 0600027A] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Immutable.ImmutableDictionary`2[System.__Canon,System.__Canon]:GetEnumerator():System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:this
  [INL82 IL=0027 TR=000019 060002F7] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:get_Current():System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:this
  [INL87 IL=0036 TR=000026 06007DD0] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Key():System.__Canon:this
  [INL88 IL=0043 TR=000030 06007DD1] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Value():System.__Canon:this
  [INL00 IL=0048 TR=000036 06000635] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonWriter:WriteString(System.String,System.String):this
  [INL00 IL=0055 TR=000012 060002F9] [FAILED: call site: inline exceeds budget] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:MoveNext():bool:this
  [INL00 IL=0072 TR=000041 060002FB] [FAILED: call site: inline exceeds budget] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:Dispose():this
  [INL89 IL=0079 TR=000038 0600057A] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteEndObject():this

Budget: initialTime=315, finalTime=3137, initialBudget=3150, currentBudget=3150
Budget: discretionary inline caused a force inline
Budget: initialSize=2069, finalSize=16857

So with PGO we run out of budget for a couple of top-level methods that feature prominently in the profile.

As before let me try and up the budget to see what that does... I am not suggesting this as a fix but it might tempt us to look at some kind of priority-first inlining process where we try and make sure we're maximizing the effectiveness of the budget.

Looks like a budget value of 18 is sufficient to regain perf. That's quite a jump from where we are now. Non-pgo version is 2811 bytes, PGO version with budget=18 is 4664.

AndyAyersMS commented 1 year ago

System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512)

Default [link]

newplot (8)

PGO [link]

newplot (9)


The ContainsKey tests are notoriously noisy, but it looks like there were some regressions in PGO that now make it run slower on average.

The first one, on Feb 9, was likely a result of https://github.com/dotnet/runtime/pull/81557; not clear yet why this helped non-PGO but hurt PGO.

The other jumps are harder to pin down; the one on March 26 is seemingly because of https://github.com/dotnet/runtime/pull/83484 which we have seen implicated in other changes but it is hard to see a connection.

Locally I see PGO as frequently faster or at par.

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
ConcurrentDictionary Job-WSQNSB \base-rel\corerun.exe 512 2.478 us 0.1069 us 0.1231 us 2.451 us 2.312 us 2.703 us 1.00 0.00 - NA
ConcurrentDictionary Job-KIYTHA \diff-rel\corerun.exe 512 2.223 us 0.0630 us 0.0725 us 2.218 us 2.123 us 2.383 us 0.90 0.04 - NA

Profiling shows all time is in these two methods

87.65%   5.464E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.Int32].TryGetValue(!0,!1&)
11.41%   7.11E+06    Tier-1   [MicroBenchmarks]System.Collections.ContainsKeyTrue`2[System.Int32,System.Int32].ConcurrentDictionary()

Going to defer further investigation on this one.


Latest runs show PGO behaving better: newplot - 2023-04-19T120759 506

AndyAyersMS commented 1 year ago

Benchstone.BenchI.MulMatrix.Test

Default [link]

newplot (10)

PGO [link]

newplot (11)


Default:

PGO:


All the time in this benchmark is in one method:

;; default

99.16%   4.365E+07   Tier-1   [MicroBenchmarks]MulMatrix.Inner(int32[][],int32[][],int32[][])

;; pgo

99.41%   7.545E+07   Tier-1   [MicroBenchmarks]MulMatrix.Inner(int32[][],int32[][],int32[][])

With PGO we actually jit this method 7 times (#84517 tracks fixing some of this). Only the last one matters for benchmark perf; all the others are no longer active when BDN is measuring (though likely some are active during the warmup/pilot stages, and might perturb BDN's choice of iteration strategy. However, measuring with --apples --iterationCount 20 still shows a regression, so any BDN strategy variance is a second-order effect here).

 832: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Instrumented Tier0, IL size=435, code size=2943, hash=0x482bb336]
 833: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x60 with Dynamic PGO, IL size=435, code size=1323, hash=0x482bb336]
 834: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x2a with Dynamic PGO, IL size=435, code size=1529, hash=0x482bb336]
 835: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Instrumented Tier0, IL size=435, code size=2943, hash=0x482bb336]
 836: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x60 with Dynamic PGO, IL size=435, code size=1323, hash=0x482bb336]
 837: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x2a with Dynamic PGO, IL size=435, code size=1547, hash=0x482bb336]
 838: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1 with Dynamic PGO, IL size=435, code size=1194, hash=0x482bb336]

The dynamic profile only reflects the first loop or two of the method, because those loops are hot enough that we escape to OSR before they finish iterating. Thus at Tier1 we see an incomplete profile where many blocks are zero that we know must be executed. This is likely the root cause of the regression from https://github.com/dotnet/runtime/pull/70941; before then we would instrument the OSR method and so have a more representative picture of profile data.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           100        100    [000..004)-> BB06 (always)                     IBC 
BB02 [0001]  1       BB06                  100k  100000    [004..008)-> BB04 (always)                     bwd bwd-target IBC 
BB03 [0002]  1       BB04                 8035k 8034927    [008..02A)                                     bwd bwd-target IBC 
BB04 [0003]  2       BB02,BB03            8135k 8134927    [02A..02F)-> BB03 ( cond )                     bwd bwd-src IBC 
BB05 [0004]  1       BB04                  100k  100000    [02F..033)                                     bwd IBC 
BB06 [0005]  2       BB01,BB05             100k  100000    [033..038)-> BB02 ( cond )                     bwd bwd-src IBC 
BB07 [0006]  1       BB06                  0          0    [038..03C)-> BB15 (always)                     rare IBC 
BB08 [0007]  1       BB15                  0          0    [03C..040)-> BB13 (always)                     rare bwd bwd-target IBC 
BB09 [0008]  1       BB13                  0          0    [040..044)-> BB11 (always)                     rare bwd bwd-target IBC 
BB10 [0009]  1       BB11                  0          0    [044..060)                                     rare bwd bwd-target IBC 

This looks like a case that should have been handled by https://github.com/dotnet/runtime/pull/84312 -- because the instrumented methods never return, profile incorporation things the entry block weight is zero, and schedules repairs; these run but fail to populate the remainder of the method with interesting counts.

Solver: entry block weight is zero
BB01 has zero weight , setting outgoing likelihoods heuristically
...
Incorporated count data had inconsistencies; repairing profile...
Enter blocks: BB01 
...
ccp backedge BB32 (0.5) -> BB33 likelihood 1
For loop at BB33 cyclic weight is 0.5 cyclic probability is 2
...
ccp backedge BB05 (1) -> BB06 likelihood 1
For loop at BB06 cyclic weight is 0.999 cyclic probability is 1000 [capped]
...
Computing block weights
cbw: BB01 :: 100
cbw (loop head): BB06 :: 100000
cbw: BB02 :: 100000
cbw (loop head): BB04 :: 8134927
cbw: BB03 :: 8034927
cbw: BB05 :: 100000
cbw: BB07 :: 0
cbw (loop head): BB15 :: 0
cbw: BB08 :: 0
cbw (loop head): BB13 :: 0
cbw: BB09 :: 0
...

The problem seems to be that we still use the edge likelihoods laid down by the profile incorporation and so think the loops that never exit actually never exit, despite seeing that they have high $C_p$ values (which we cap). The capping decision does not get fed back into an edge likelihood adjustment.

Some ideas on how to resolve this:

If we go back and adjust we may also run into complications from cases like

while (true)
{
      if (p) throw new Exception();
      ...
}

where the loop is postdominated by a throw; synthesis repair will want to make the throw have nonzero weight. Ultimately, we will have to permit throws to have nonzero weights.


Testing some enhancements to profile repair that look promising on this case, and don't seem too disruptive overall.


Update 4/16: official charts are slow to update but I can pull the data myself now. The PGO regression is largely mitigated:

image

And now the "official chart"

newplot (27)

AndyAyersMS commented 1 year ago

LinqBenchmarks.Count00ForX

Default [link]

Hard to tell from this but default is at about 122 with error bars extending +/- 10. newplot (12)

PGO [link] -- note the y-axis scale differs from above.

PGO is pretty consistently at 163.

newplot (13)

At first blush I can't repro this locally. My machine is somewhat faster than the lab, so I assume it's the default case is running slowly for me locally.

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Count00ForX Job-SGUCJG Default 131.5 ms 26.31 ms 30.30 ms 153.0 ms 94.10 ms 162.7 ms 1.00 0.00 200 B 1.00
Count00ForX Job-KTJAFK PGO 122.8 ms 1.85 ms 1.54 ms 122.8 ms 119.94 ms 126.3 ms 0.83 0.13 200 B 1.00

However looking at the distribution of measurements

;; default
-------------------- Histogram --------------------
[ 79.447 ms ; 111.213 ms) | @@@@@@@@
[111.213 ms ; 142.932 ms) | @
[142.932 ms ; 177.358 ms) | @@@@@@@@@@@
---------------------------------------------------

;; pgo
-------------------- Histogram --------------------
[119.082 ms ; 123.442 ms) | @@@@@@@@@
[123.442 ms ; 127.124 ms) | @@@@
---------------------------------------------------

it looks like the default case is just getting unlucky and hitting its slow variant more often for me, and there's a set of runs that are consistently faster than PGO. So will keep digging.


It looks like both modes end up measuring OSR code, but default switches to Tier1 and PGO never makes it there, in part because we redo tier0/osr over again for the main method (#84319, fix is close)

;; default

65.49%   3.249E+07   OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()
33.72%   1.673E+07   Tier-1   [MicroBenchmarks]LinqBenchmarks.Count00ForX()

009 4980.124 -- 5274.105 : 293.981
010 5275.485 -- 5576.441 : 300.955
011 5577.827 -- 5767.517 : 189.690
012 5768.957 -- 5953.575 : 184.618

;; pgo

78.87%   2.952E+07   OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()
19.77%   7.4E+06     OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()

(all iterations same time)

if you crank up the iteration count then PGO looks quite a bit better. Here we are with 100 iterations:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Count00ForX Job-AVMULQ Default 102.08 ms 7.204 ms 21.24 ms 93.39 ms 88.49 ms 161.3 ms 1.00 0.00 200 B 1.00
Count00ForX Job-DJMRVT PGO 71.47 ms 11.541 ms 34.03 ms 50.62 ms 45.31 ms 130.5 ms 0.69 0.29 200 B 1.00

Looking at the benchmark sources

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/Linq/Linq.cs#L291-L308

this is a test where we manually set the iteration count and as a result each invocation of the benchmark method takes a long time and so BDN decides not to call it very often per iteration (just twice)

// AfterActualRun
WorkloadResult   1: 2 op, 246812800.00 ns, 123.4064 ms/op
WorkloadResult   2: 2 op, 245786100.00 ns, 122.8931 ms/op

So to reach the 30 call threshold requires some number of iterations (look like we do 10 or so warming up).

Once we fix #84517 I expect this benchmark to improve some, but we might also consider removing the internal iterations and letting BDN pick the counts for us.

AndyAyersMS commented 1 year ago

Benchstone.MDBenchI.MDLogicArray.Test

Default [link]

newplot (14)

PGO [link]

newplot (15)

Recent regression is likely from: https://github.com/dotnet/runtime/pull/84312


This test also suffers from repeated tiering up (https://github.com/dotnet/runtime/issues/84517)

;; pgo
 830: JIT compiled Benchstone.MDBenchI.MDLogicArray:Test():bool:this [Instrumented Tier0, IL size=63, code size=258, hash=0x5774c8bd]
 831: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Instrumented Tier0, IL size=139, code size=557, hash=0x420d3983]
 834: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x39 with Dynamic PGO, IL size=139, code size=234, hash=0x420d3983]
 835: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x21 with Dynamic PGO, IL size=139, code size=326, hash=0x420d3983]
 836: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Instrumented Tier0, IL size=139, code size=557, hash=0x420d3983]
 837: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x39 with Dynamic PGO, IL size=139, code size=234, hash=0x420d3983]
 838: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x21 with Dynamic PGO, IL size=139, code size=326, hash=0x420d3983]
 839: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1 with Dynamic PGO, IL size=139, code size=236, hash=0x420d3983]
 906: JIT compiled Benchstone.MDBenchI.MDLogicArray:Test():bool:this [Tier1-OSR @0x35 with Dynamic PGO, IL size=63, code size=353, hash=0x5774c8bd]

However not clear that is the root cause of perf issues. Profiles show

;; default

99.08%   4.95E+07    Tier-1   [MicroBenchmarks]MDLogicArray.Inner(value class Workarea&)

;; pgo

61.53%   3.985E+07   OSR      [MicroBenchmarks]MDLogicArray.Test()
37.67%   2.44E+07    Tier-1   [MicroBenchmarks]MDLogicArray.Inner(value class Workarea&)

(here Test calls Inner in a loop)

So with PGO that final OSR compile evidently inlines Inner into Test and the CQ of the computations in Inner suffers.

Suspect this is the same issue as in MulMatrix [above] where profile data is lacking because it only comes from partially executing Tier0 methods:

Inline @[000031] Finishing PHASE Profile incorporation
Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight         IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0007]  1                           10046.     10046    [000..00B)-> BB06 (always)                     IBC 
BB02 [0008]  1       BB06                10046k  10046000    [00B..00F)-> BB04 (always)                     bwd bwd-target IBC 
BB03 [0009]  1       BB04                504375k 504374837    [00F..021)                                     bwd bwd-target IBC 
BB04 [0010]  2       BB02,BB03           514421k 514420837    [021..026)-> BB03 ( cond )                     bwd bwd-src IBC 
BB05 [0011]  1       BB04                10046k  10046000    [026..02A)                                     bwd IBC 
BB06 [0012]  2       BB01,BB05           10046k  10046000    [02A..02F)-> BB02 ( cond )                     bwd bwd-src IBC 
BB07 [0013]  1       BB06                  0            0    [02F..033)-> BB14 (always)                     rare IBC 
BB08 [0014]  1       BB14                  0            0    [033..037)-> BB12 (always)                     rare bwd bwd-target IBC 
BB09 [0015]  1       BB12                  0            0    [037..039)                                     rare bwd bwd-target IBC 
BB10 [0016]  2       BB09,BB10             0            0    [039..06C)-> BB10 ( cond )                     rare bwd bwd-target bwd-src IBC 
BB11 [0017]  1       BB10                  0            0    [06C..070)                                     rare bwd IBC 

So let's try with the fix from #84817:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDLogicArray Job-INDUVT Default 320.6 ms 7.76 ms 8.93 ms 320.9 ms 308.0 ms 337.1 ms 1.00 0.00 10.59 KB 1.00
MDLogicArray Job-BHAQKC PGO 406.5 ms 4.25 ms 3.98 ms 404.8 ms 400.7 ms 413.9 ms 1.29 0.03 10.59 KB 1.00
MDLogicArray Job-DVPVHR With #84817 366.4 ms 5.87 ms 6.03 ms 365.1 ms 358.7 ms 379.0 ms 1.14 0.03 10.59 KB 1.00

Things look better but still not at par.

Will move on to another benchmark; may come back to this one.


Lab charts are finally getting updated, here's the latest

newplot - 2023-04-19T120337 911

AndyAyersMS commented 1 year ago

Span.Sorting.QuickSortSpan(Size: 512)

Default [link]

newplot (16)

Recent improvement looks like it was from https://github.com/dotnet/runtime/pull/84213

PGO [link]

newplot (17) Small regression in Jan looks like it came from https://github.com/dotnet/runtime/pull/80481


Per profiling these actually spend a substantial amount of time in Tier0 code (!)

;; default

55.66%   2.31E+06    Tier-0   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
18.80%   7.8E+05     native   clrjit.dll
15.18%   6.3E+05     Tier-1   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)

;; pgo

37.39%   1.23E+06    Tier-1   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
31.91%   1.05E+06    native   coreclr.dll
26.44%   8.7E+05     Tier-0   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)

and not surprisingly they show bimodal distribution of times

;; default
-------------------- Histogram --------------------
[ 5.363 us ; 18.080 us) | @@@@@@@@@@
[18.080 us ; 27.518 us) |
[27.518 us ; 42.777 us) | @@@@@
---------------------------------------------------

;; pgo
-------------------- Histogram --------------------
[ 4.574 us ; 23.682 us) | @@@@@@@@@@@
[23.682 us ; 38.190 us) |
[38.190 us ; 57.297 us) | @@@@
---------------------------------------------------

The benchmark forces invocationCount to be 1000 which is too small to allow tiering to kick in most of the time (note: I ran the above with --apples mode which looks like it does not honor the minimum warmup, so the results may exaggerate how much time is spent in Tier0).

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/Span/Sorting.cs#L19

Adam fixed this a while back via https://github.com/dotnet/performance/pull/1536 but it looks like now we may need to warm up longer.

If I bump up the number of measured intervals I can force the result to be dominated by Tier1 code, so running with --iterationCount 200 we see:

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
QuickSortSpan Job-JETOCQ \base-rel\corerun.exe 512 12.40 us 1.114 us 4.717 us 10.99 us 8.485 us 37.93 us 1.00 0.00 - NA
QuickSortSpan Job-NKINRZ \diff-rel\corerun.exe 512 14.85 us 1.747 us 7.399 us 13.21 us 10.424 us 56.20 us 1.23 0.33 - NA

and here PGO is slower (especially looking at the median).

This test also sees the repeat tier0/OSR effects noted in https://github.com/dotnet/runtime/issues/84517

So let's look into what happens differently at Tier1....


In Tier1 optimizations look comparable, it is mainly block layout that differs. PGO does some odd things like jumping to the end of the method and then back into the middle.

Not clear if this is easily fixable right now, so will leave things here and move on.

AndyAyersMS commented 1 year ago

System.IO.Tests.Perf_Path.GetDirectoryName

Default [link]

newplot (20)

PGO [link]

newplot (19)


Locally PGO is a bit faster for me on average (with comparable median)

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
GetDirectoryName Job-EOGLTR Default 67.79 ns 7.242 ns 7.749 ns 64.68 ns 62.16 ns 87.76 ns 1.00 0.00 0.0140 88 B 1.00
GetDirectoryName Job-YRRSZS PGO 64.70 ns 0.959 ns 0.898 ns 64.69 ns 63.03 ns 66.36 ns 1.00 0.04 0.0138 88 B 1.00

Profile shows

;; default 

03.95%   1.45E+06    ?        Unknown
50.10%   1.84E+07    Tier-1   [System.Private.CoreLib]PathInternal.NormalizeDirectorySeparators(class System.String)
17.64%   6.48E+06    Tier-1   [System.Private.CoreLib]Path.GetDirectoryName(class System.String)
14.95%   5.49E+06    native   coreclr.dll
05.88%   2.16E+06    Tier-1   [System.Private.CoreLib]PathInternal.GetRootLength(value class System.ReadOnlySpan`1<wchar>)
02.91%   1.07E+06    Tier-1   [System.Private.CoreLib]Buffer.Memmove(unsigned int8&,unsigned int8&,unsigned int)

;; pgo

04.71%   1.77E+06    ?        Unknown
42.50%   1.597E+07   Tier-1   [System.Private.CoreLib]PathInternal.NormalizeDirectorySeparators(class System.String)
20.92%   7.86E+06    Tier-1   [System.Private.CoreLib]Path.GetDirectoryName(class System.String)
16.21%   6.09E+06    native   coreclr.dll
04.92%   1.85E+06    Tier-1   [System.Private.CoreLib]PathInternal.GetRootLength(value class System.ReadOnlySpan`1<wchar>)
03.06%   1.15E+06    Tier-1   [System.Private.CoreLib]Buffer.Memmove(unsigned int8&,unsigned int8&,unsigned int)
02.82%   1.06E+06    Tier-1   [System.Private.CoreLib]PathInternal.IsDevice(value class System.ReadOnlySpan`1<wchar>)

where corelib method is AllocateStringFastMP_InlineGetThread

Inner loop codegen for NormalizeDirectorySeparators is the same; rest of method is different (less inlining, quite different code layout). Similar for GetDirectoryName(System.String) (two initial loops codegen looks similar).

Don't see anything obvious here so will move on.

AndyAyersMS commented 1 year ago

System.Collections.CtorFromCollection.SortedSet(Size: 512)

Default [link]

newplot (22)

PGO [link]

newplot (21)


At first glance looks like default and PGO are similar here. Regression on jan 18 seems to be from https://github.com/dotnet/runtime/pull/80769 and was flagged by autofiling here https://github.com/dotnet/runtime/issues/84900 but marked as look again.

Nominal profile

;; default

06.48%   2.85E+06    ?        Unknown
29.27%   1.288E+07   Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].PickPivotAndPartition(value class System.Span`1<!0>)
22.98%   1.011E+07   native   coreclr.dll
14.93%   6.57E+06    Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].InsertionSort(value class System.Span`1<!0>)
09.70%   4.27E+06    Tier-1   [System.Collections]System.Collections.Generic.SortedSet`1[System.Int32].ConstructRootFromSortedArray(!0[],int32,int32,class Node<!0>)
09.14%   4.02E+06    Tier-1   [System.Collections]System.Collections.Generic.SortedSet`1[System.Int32]..ctor(class System.Collections.Generic.IEnumerable`1<!0>,class System.Collections.Generic.IComparer`1<!0>)
02.84%   1.25E+06    Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].IntroSort(value class System.Span`1<!0>,int32)
01.57%   6.9E+05     native   ntoskrnl.exe
01.41%   6.2E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericComparer`1[System.Int32].Compare(!0,!0)

coreclr native code hits are in

JIT_TrialAllocSFastMP_InlineGetThread and memset

Will look into that regression as a separate work item.

AndyAyersMS commented 1 year ago

MicroBenchmarks.Serializers.XmlToStream.XmlSerializer

Default [link]

newplot (23)

PGO [link]

newplot (24)


Locally I see PGO as faster

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
XmlSerializer Job-JNOFJH Default 346.2 us 7.37 us 8.19 us 344.3 us 334.1 us 365.1 us 1.00 0.00 10.7962 66.96 KB 1.00
XmlSerializer Job-EFUXJL PGO 310.6 us 5.51 us 5.16 us 311.5 us 299.6 us 320.4 us 0.89 0.02 9.9751 66.96 KB 1.00

Error bars on PGO runs suggests perhaps mid-benchmark delivery of Tier1 code. But I don't see anything like that in profiles:

;; pgo

02.14%   7.6E+05     ?        Unknown
27.32%   9.7E+06     Tier-1   [System.Private.Xml]XmlUtf8RawTextWriter.RawText(wchar*,wchar*)
08.00%   2.84E+06    Tier-1   [System.Private.Xml]XmlWellFormedWriter.WriteStartElement(class System.String,class System.String,class System.String)
07.18%   2.55E+06    Tier-1   [System.Private.Xml]XmlUtf8RawTextWriter.WriteRawWithCharChecking(wchar*,wchar*)
06.39%   2.27E+06    Tier-1   [System.Private.Xml]XsdDateTime.TryFormat(value class System.Span`1<wchar>,int32&)
06.36%   2.26E+06    Tier-1   [System.Private.Xml]XmlWellFormedWriter.WriteEndElement()
05.83%   2.07E+06    Tier-1   [Microsoft.GeneratedCode]XmlSerializationWriterMyEventsListerViewModel.Write3_MyEventsListerItem(class System.String,class System.String,class MicroBenchmarks.Serializers.MyEventsListerItem,bool,bool)
04.87%   1.73E+06    native   coreclr.dll
04.17%   1.48E+06    Tier-1   [System.Private.Xml]XmlUtf8RawTextWriter.WriteElementTextBlock(wchar*,wchar*)
03.38%   1.2E+06     Tier-1   [System.Private.Xml]XmlWellFormedWriter.CheckNCName(class System.String)
03.24%   1.15E+06    Tier-1   [System.Private.Xml]XmlWellFormedWriter.PushNamespaceImplicit(class System.String,class System.String)
02.87%   1.02E+06    Tier-1   [System.Private.Xml]XmlWellFormedWriter.AdvanceState(value class Token)
02.51%   8.9E+05     Tier-1   [System.Private.CoreLib]Hashtable.Remove(class System.Object)
02.03%   7.2E+05     Tier-1   [System.Private.Xml]XsdDateTime.GetYearMonthDay(int32&,int32&,int32&)
01.83%   6.5E+05     Tier-1   [System.Private.Xml]XmlWellFormedWriter.LookupPrefix(class System.String)
01.55%   5.5E+05     Tier-1   [System.Private.Xml]XmlWellFormedWriter.StartElementContent()
01.55%   5.5E+05     Tier-1   [System.Private.CoreLib]Hashtable.Insert(class System.Object,class System.Object,bool)
01.49%   5.3E+05     Tier-1   [System.Private.CoreLib]Hashtable.ContainsKey(class System.Object)
01.21%   4.3E+05     Tier-1   [System.Private.Xml]XmlWellFormedWriter.WriteRaw(class System.String)

Going to assume this one is noise and move on.

AndyAyersMS commented 1 year ago

System.Text.Json.Tests.Perf_Segment.ReadMultiSegmentSequence(segmentSize: 4096, TestCase: Json4KB)

Default [link]

newplot (25)

PGO [link]

newplot (26)

Regression in December was possibly https://github.com/dotnet/runtime/pull/78874 (or https://github.com/dotnet/runtime/pull/78929)

Recent "improvement" has no obvious cause, also (hard to tell from above) perf dipped and then jumped right back up.


Locally I get quite noisy results for pgo

Method Job Toolchain TestCase segmentSize Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
ReadMultiSegmentSequence Job-AZGVVK Default Json4KB 4096 5.786 us 0.1012 us 0.0898 us 5.795 us 5.612 us 5.926 us 1.00 0.00 - NA
ReadMultiSegmentSequence Job-GHDQTH PGO Json4KB 4096 7.706 us 0.1137 us 0.0949 us 7.713 us 7.548 us 7.902 us 1.33 0.03 - NA
Method Job Toolchain TestCase segmentSize Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
ReadMultiSegmentSequence Job-YENBLQ Default Json4KB 4096 6.205 us 0.1507 us 0.1735 us 6.208 us 5.921 us 6.580 us 1.00 0.00 - NA
ReadMultiSegmentSequence Job-PQZKNM PGO Json4KB 4096 5.089 us 0.1780 us 0.2050 us 5.019 us 4.856 us 5.436 us 0.82 0.03 - NA

Iteration-to-iteration numbers look stable though, so its not something timing related within the benchmark run itself.

Profile (from run where PGO was slow)

;; default

25.97%   9.91E+06    Tier-1   [System.Text.Json]Utf8JsonReader.SkipWhiteSpace()
19.23%   7.34E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ReadMultiSegment()
12.29%   4.69E+06    Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
10.61%   4.05E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeStringMultiSegment()
07.49%   2.86E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNextTokenMultiSegment(unsigned int8)
04.59%   1.75E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeValueMultiSegment(unsigned int8)
03.83%   1.46E+06    Tier-1   [System.Text.Json]Utf8JsonReader.TryGetNumberMultiSegment(value class System.ReadOnlySpan`1<unsigned int8>,int32&)
03.41%   1.3E+06     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumePropertyNameMultiSegment()
02.96%   1.13E+06    Tier-1   [MicroBenchmarks]Perf_Segment.ReadMultiSegmentSequence(int32)
02.12%   8.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeIntegerDigitsMultiSegment(value class System.ReadOnlySpan`1<unsigned int8>&,int32&)
01.07%   4.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNumberMultiSegment()
01.02%   3.9E+05     Tier-1   [System.Private.CoreLib]SpanHelpers.NonPackedIndexOfValueType(!!0&,!!0,int32)

;; pgo

25.46%   1.283E+07   Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNextTokenOrRollbackMultiSegment(unsigned int8)
13.71%   6.91E+06    Tier-1   [MicroBenchmarks]Perf_Segment.ReadMultiSegmentSequence(int32)
11.19%   5.64E+06    Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
09.68%   4.88E+06    Tier-1   [System.Text.Json]Utf8JsonReader.set_ValueSpan(value class System.ReadOnlySpan`1<unsigned int8>)
08.59%   4.33E+06    Tier-1   [System.Text.Json]Utf8JsonReader.SkipWhiteSpaceMultiSegment()
07.96%   4.01E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeValueMultiSegment(unsigned int8)
05.32%   2.68E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNumberMultiSegment()
03.71%   1.87E+06    native   coreclr.dll
03.49%   1.76E+06    Tier-1   [System.Text.Json]Utf8JsonReader.HasMoreDataMultiSegment()
02.42%   1.22E+06    Tier-1   [System.Text.Json]Utf8JsonReader.set_ValueSequence(value class System.Buffers.ReadOnlySequence`1<unsigned int8>)
01.11%   5.6E+05     Tier-1   [System.Text.Json]Utf8JsonReader.get_BytesConsumed()

Per PerfView the native code in the pgo side is JIT_ByRefWriteBarrier.

This one will take a while to sort out as the inlining is quite different. Let me see if I can catch a profile where PGO is fast.

;; pgo (fast)

32.80%   1.036E+07   Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNextTokenMultiSegment(unsigned int8)
29.69%   9.38E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ReadMultiSegment()
10.95%   3.46E+06    Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
07.69%   2.43E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeValueMultiSegment(unsigned int8)
06.81%   2.15E+06    Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNumberMultiSegment()
03.83%   1.21E+06    Tier-1   [MicroBenchmarks]Perf_Segment.ReadMultiSegmentSequence(int32)
01.52%   4.8E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumePropertyNameMultiSegment()

Looks quite different, so this is a puzzle -- how can we see this much variation run to run with PGO?

If we look at inclusive times via PerfView, we see

PGO FAST

image

PGO SLOW

image

So the initial take on this is that the codegen for the benchmark root method itself is the key -- in the slow case we seem to inline into this method quite a bit more, and (guessing for now) perhaps hit budget limits and so leave some smallish methods un-inlined...?


Sure enough we blow our budget in PGO and leave a bunch of good candidates on the floor

;; Default

Inlines into 06001B06 [via ExtendedDefaultPolicy] System.Text.Json.Tests.Perf_Segment:ReadMultiSegmentSequence(int):this:
  [INL01 IL=0009 TR=000004 06007CE6] [INLINED: call site: profitable inline] System.Collections.Generic.Dictionary`2[int,System.Buffers.ReadOnlySequence`1[ubyte]]:get_Item(int):System.Buffers.ReadOnlySequence`1[ubyte]:this
    [INL00 IL=0002 TR=000023 06007CF3] [FAILED: callee: too many il bytes] System.Collections.Generic.Dictionary`2[int,System.Buffers.ReadOnlySequence`1[ubyte]]:FindValue(int):byref:this
    [INL00 IL=0024 TR=000031 06002114] [FAILED: callee: does not return] System.ThrowHelper:ThrowKeyNotFoundException[int](int)
  [INL02 IL=0023 TR=000014 06000385] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonReader:.ctor(System.Buffers.ReadOnlySequence`1[ubyte],System.Text.Json.JsonReaderOptions):this
    [INL03 IL=0004 TR=000060 06000332] [INLINED: call site: profitable inline] System.Text.Json.JsonReaderState:.ctor(System.Text.Json.JsonReaderOptions):this
    [INL00 IL=0009 TR=000062 06000384] [FAILED: callee: too many il bytes] System.Text.Json.Utf8JsonReader:.ctor(System.Buffers.ReadOnlySequence`1[ubyte],bool,System.Text.Json.JsonReaderState):this
  [INL04 IL=0030 TR=000016 0600034B] [INLINED: call site: profitable inline] System.Text.Json.Utf8JsonReader:Read():bool:this
    [INL00 IL=0009 TR=000149 0600035F] [FAILED: callee: noinline per IL/cached result] System.Text.Json.Utf8JsonReader:ReadSingleSegment():bool:this
    [INL00 IL=0017 TR=000115 06000386] [FAILED: callee: noinline per IL/cached result] System.Text.Json.Utf8JsonReader:ReadMultiSegment():bool:this
    [INL05 IL=0035 TR=000134 0600033F] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonReader:get_TokenType():ubyte:this
    [INL00 IL=0055 TR=000147 060000FE] [FAILED: callee: does not return] System.Text.Json.ThrowHelper:ThrowJsonReaderException(byref,int,ubyte,System.ReadOnlySpan`1[ubyte])
Budget: initialTime=174, finalTime=488, initialBudget=1740, currentBudget=1740
Budget: initialSize=997, finalSize=2678

;; PGO

Inlines into 06001B06 [via ExtendedDefaultPolicy] System.Text.Json.Tests.Perf_Segment:ReadMultiSegmentSequence(int):this:
  [INL01 IL=0009 TR=000004 06007CE6] [INLINED: call site: profitable inline] System.Collections.Generic.Dictionary`2[int,System.Buffers.ReadOnlySequence`1[ubyte]]:get_Item(int):System.Buffers.ReadOnlySequence`1[ubyte]:this
    [INL02 IL=0002 TR=000023 06007CF3] [INLINED: call site: profitable inline] System.Collections.Generic.Dictionary`2[int,System.Buffers.ReadOnlySequence`1[ubyte]]:FindValue(int):byref:this
      [INL03 IL=0066 TR=000156 06001468] [INLINED: callee: below ALWAYS_INLINE size] System.Int32:GetHashCode():int:this
      [INL04 IL=0074 TR=000162 06007D12] [INLINED: callee: aggressive inline attribute] System.Collections.Generic.Dictionary`2[int,System.Buffers.ReadOnlySequence`1[ubyte]]:GetBucket(uint):byref:this
        [INL05 IL=0018 TR=000247 06007AAB] [INLINED: callee: aggressive inline attribute] System.Collections.HashHelpers:FastMod(uint,uint,ulong):uint
      [INL06 IL=0129 TR=000208 06007C82] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.EqualityComparer`1[int]:get_Default():System.Collections.Generic.EqualityComparer`1[int]
      [INL07 IL=0141 TR=000213 06007C91] [INLINED: callee: aggressive inline attribute DEVIRT] System.Collections.Generic.GenericEqualityComparer`1[int]:Equals(int,int):bool:this
        [INL08 IL=0025 TR=000292 06001467] [INLINED: callee: below ALWAYS_INLINE size] System.Int32:Equals(int):bool:this
      [INL00 IL=0177 TR=000091 06007DA7] [FAILED: call site: target not direct] System.Collections.Generic.IEqualityComparer`1[int]:GetHashCode(int):int:this
      [INL09 IL=0187 TR=000096 06007D12] [INLINED: callee: aggressive inline attribute] System.Collections.Generic.Dictionary`2[int,System.Buffers.ReadOnlySequence`1[ubyte]]:GetBucket(uint):byref:this
        [INL10 IL=0018 TR=000314 06007AAB] [INLINED: callee: aggressive inline attribute] System.Collections.HashHelpers:FastMod(uint,uint,ulong):uint
      [INL00 IL=0248 TR=000151 06007DA6] [FAILED: call site: target not direct] System.Collections.Generic.IEqualityComparer`1[int]:Equals(int,int):bool:this
      [INL00 IL=0277 TR=000142 0600213A] [FAILED: callee: does not return] System.ThrowHelper:ThrowInvalidOperationException_ConcurrentOperationsNotSupported()
    [INL00 IL=0024 TR=000032 06002114] [FAILED: callee: does not return] System.ThrowHelper:ThrowKeyNotFoundException[int](int)
  [INL11 IL=0023 TR=000014 06000385] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonReader:.ctor(System.Buffers.ReadOnlySequence`1[ubyte],System.Text.Json.JsonReaderOptions):this
    [INL12 IL=0004 TR=000350 06000332] [INLINED: call site: profitable inline] System.Text.Json.JsonReaderState:.ctor(System.Text.Json.JsonReaderOptions):this
    [INL00 IL=0009 TR=000352 06000384] [FAILED: callee: too many il bytes] System.Text.Json.Utf8JsonReader:.ctor(System.Buffers.ReadOnlySequence`1[ubyte],bool,System.Text.Json.JsonReaderState):this
  [INL13 IL=0030 TR=000016 0600034B] [INLINED: call site: profitable inline] System.Text.Json.Utf8JsonReader:Read():bool:this
    [INL00 IL=0009 TR=000440 0600035F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonReader:ReadSingleSegment():bool:this
    [INL14 IL=0017 TR=000405 06000386] [INLINED: call site: profitable inline] System.Text.Json.Utf8JsonReader:ReadMultiSegment():bool:this
      [INL00 IL=0004 TR=000452 06000341] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:set_HasValueSequence(bool):this
      [INL00 IL=0011 TR=000455 06000343] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:set_ValueIsEscaped(bool):this
      [INL00 IL=0026 TR=000462 06000339] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:set_ValueSpan(System.ReadOnlySpan`1[ubyte]):this
      [INL00 IL=0041 TR=000469 06000346] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:set_ValueSequence(System.Buffers.ReadOnlySequence`1[ubyte]):this
      [INL00 IL=0047 TR=000471 06000388] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:HasMoreDataMultiSegment():bool:this
      [INL00 IL=0082 TR=000642 0600038C] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:SkipWhiteSpaceMultiSegment():this
      [INL00 IL=0088 TR=000644 06000388] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:HasMoreDataMultiSegment():bool:this
      [INL00 IL=0119 TR=000503 0600033A] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:get_BytesConsumed():long:this
      [INL00 IL=0124 TR=000505 0600033C] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:set_TokenStartIndex(long):this
      [INL00 IL=0147 TR=000637 0600039E] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:ConsumeNextTokenOrRollbackMultiSegment(ubyte):bool:this
      [INL00 IL=0176 TR=000634 0600035B] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:EndObject():this
      [INL00 IL=0204 TR=000632 060000FE] [FAILED: call site: inline exceeds budget] System.Text.Json.ThrowHelper:ThrowJsonReaderException(byref,int,ubyte,System.ReadOnlySpan`1[ubyte])
      [INL00 IL=0250 TR=000592 06000393] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:ConsumePropertyNameMultiSegment():bool:this
      [INL00 IL=0323 TR=000559 0600035D] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:EndArray():this
      [INL00 IL=0332 TR=000554 0600038D] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:ConsumeValueMultiSegment(ubyte):bool:this
      [INL00 IL=0351 TR=000544 0600038D] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:ConsumeValueMultiSegment(ubyte):bool:this
      [INL00 IL=0361 TR=000538 0600039E] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:ConsumeNextTokenOrRollbackMultiSegment(ubyte):bool:this
      [INL00 IL=0375 TR=000513 0600038B] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:ReadFirstTokenMultiSegment(ubyte):bool:this
    [INL00 IL=0035 TR=000425 0600033F] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:get_TokenType():ubyte:this
    [INL00 IL=0055 TR=000438 060000FE] [FAILED: call site: inline exceeds budget] System.Text.Json.ThrowHelper:ThrowJsonReaderException(byref,int,ubyte,System.ReadOnlySpan`1[ubyte])
Budget: initialTime=174, finalTime=1916, initialBudget=1740, currentBudget=1740
Budget: went over budget by 176
Budget: discretionary inline caused a force inline
Budget: initialSize=997, finalSize=9925

However, I have a profile of a "fast" PGO run and see what look like similar inlines, so perhaps the issue lies elsewhere? Need to dig in deeper.

AndyAyersMS commented 1 year ago

System.Memory.Span.SequenceEqual(Size: 33)

Default [link]

newplot (28)

PGO [link]

newplot (29)

Actual perf difference is quite small. But let's take a look anyways.

One odd thing is that if I look at the methods jitted with PGO`

 841: JIT compiled System.Memory.Span`1[ubyte]:SequenceEqual():bool:this [Tier0, IL size=28, code size=151, hash=0xbffeed85]
 877: JIT compiled System.SpanHelpers:SequenceEqual(byref,byref,ulong):bool [Instrumented Tier1 with Static PGO, IL size=612, code size=302, hash=0xacbc5054]
 878: JIT compiled System.Memory.Span`1[ubyte]:SequenceEqual():bool:this [Instrumented Tier0, IL size=28, code size=151, hash=0xbffeed85]
 881: JIT compiled System.MemoryExtensions:SequenceEqual[ubyte](System.Span`1[ubyte],System.ReadOnlySpan`1[ubyte]):bool [Instrumented Tier1, IL size=96, code size=42, hash=0x608a246a]
 882: JIT compiled System.SpanHelpers:SequenceEqual(byref,byref,ulong):bool [Tier1 with Static PGO, IL size=612, code size=302, hash=0xacbc5054]
 883: JIT compiled System.Memory.Span`1[ubyte]:SequenceEqual():bool:this [Tier1, IL size=28, code size=81, hash=0xbffeed85]
 886: JIT compiled System.MemoryExtensions:SequenceEqual[ubyte](System.Span`1[ubyte],System.ReadOnlySpan`1[ubyte]):bool [Tier1, IL size=96, code size=42, hash=0x608a246a]

It looks like the SpanHelpers version goes from Tier1-instr to Tier1 + static pgo; I would have expected that second one to have picked up the profile data from the first.

Profiling shows

;; default

05.27%   1.96E+06    ?        Unknown
53.87%   2.004E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)
33.28%   1.238E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceEqual()
06.64%   2.47E+06    Tier-1   [e8c3bb45-4e02-4021-8249-2c10d4bb2bd9]Runnable_0.WorkloadActionUnroll(int64)

;; pgo

05.36%   2.08E+06    ?        Unknown
52.90%   2.054E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)
36.21%   1.406E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceEqual()
04.53%   1.76E+06    Tier-1   [8aad0c82-fc06-4eb7-9445-88304dca34eb]Runnable_0.WorkloadActionUnroll(int64)

So perhaps the place to look is in System.Memory.Span.


Think the small regression here is from methods being [Intrinsic] and not instrumented. Not sure we need to fix this; let's see if it turns up elsewhere.


now fixed thanks to #85130 newplot - 2023-04-22T075639 131

EgorBo commented 1 year ago

@AndyAyersMS

March 29 regression looks like #83945

Yes it is possible, e.t. SequenceEquals has [Intrinsic] attribute because it's unrolled and now it's not instrumented, we can add exceptions on which intrinsics to still instrument

AndyAyersMS commented 1 year ago

It looks like the SpanHelpers version goes from Tier1-instr to Tier1 + static pgo; I would have expected that second one to have picked up the profile data from the first.

Ah perhaps this is from it being intrinsic and so the initial Tier1-instr is not actually instrumenting, and then the subsequent rejit produces the same code? In other words, another instance of https://github.com/dotnet/runtime/issues/84517 but this time all at Tier1.

AndyAyersMS commented 1 year ago

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, IgnoreNonSpace))

Default [link].html)

newplot (30)

PGO [link].html)

newplot (31)


Looks like PGO / Default are equivalent, but if you compare small windows one may look faster than the other.

AndyAyersMS commented 1 year ago

Benchstone.MDBenchF.MDInvMt.Test

Default [link]

newplot (32)

PGO [link]

newplot (33)


PGO now faster, likely from https://github.com/dotnet/runtime/pull/84741.

AndyAyersMS commented 1 year ago

PerfLabTests.CastingPerf.ObjObjIsFoo

Default [link]

newplot (34)

PGO [link]

newplot (35)


perf looks comparable

AndyAyersMS commented 1 year ago

Benchstone.BenchF.InvMt.Test

Default [link]

newplot (36)

PGO [link]

newplot (37)

Method Job Toolchain Mean Error StdDev Median Min Max Ratio Gen0 Allocated Alloc Ratio
InvMt Job-BFDREG Default 2.947 ms 0.0335 ms 0.0279 ms 2.937 ms 2.925 ms 3.020 ms 1.00 10.4167 105.07 KB 1.00
InvMt Job-RTAWRH PGO 2.477 ms 0.0422 ms 0.0395 ms 2.476 ms 2.412 ms 2.547 ms 0.84 8.9286 105.07 KB 1.00
AndyAyersMS commented 1 year ago

ByteMark.BenchLUDecomp

Default [link]

newplot (38)

PGO [link]

newplot (39)

PGO regression looks like https://github.com/dotnet/runtime/pull/70941


Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
BenchLUDecomp Job-CVQMKZ Default 1.239 s 0.0247 s 0.0264 s 1.233 s 1.212 s 1.298 s 1.00 0.00 42000.0000 25000.0000 8000.0000 206.22 MB 1.00
BenchLUDecomp Job-JXLRSB PGO 1.470 s 0.0267 s 0.0249 s 1.464 s 1.438 s 1.515 s 1.18 0.03 42000.0000 25000.0000 8000.0000 206.22 MB 1.00

Note the very long times here, so BDN is only running this benchmark once per interval; this means we may not be seeing Tier1 codegen. Let's see what the profile says:

;; default

83.53%   1.487E+08   Tier-1   [MicroBenchmarks]LUDecomp.ludcmp(float64[][],int32,int32[],int32&)
07.26%   1.292E+07   native   coreclr.dll
03.85%   6.86E+06    native   ntoskrnl.exe
02.89%   5.14E+06    Tier-1   [MicroBenchmarks]LUDecomp.DoLUIteration(float64[][],float64[],float64[][][],float64[][],int32)
02.01%   3.58E+06    Tier-1   [MicroBenchmarks]LUDecomp.lubksb(float64[][],int32,int32[],float64[])

;; pgo
86.61%   1.872E+08   Tier-1   [MicroBenchmarks]LUDecomp.ludcmp(float64[][],int32,int32[],int32&)
06.00%   1.297E+07   native   coreclr.dll
03.56%   7.69E+06    native   ntoskrnl.exe
02.67%   5.77E+06    Tier-1   [MicroBenchmarks]LUDecomp.Run()

(coreclr code is JIT_NewArr1)

So much for that theory, we do reach tier1.

Maybe deficient repair? PGO only sees initial blocks with counts, and repair doesn't light up the rest...

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           100         100    [000..015)-> BB10 (always)                     IBC 
BB02 [0001]  1       BB10                  100k   100000    [015..024)-> BB06 (always)                     bwd bwd-target IBC 
BB03 [0002]  1       BB06                10666k 10665989    [024..032)-> BB05 ( cond )                     bwd bwd-target IBC 
BB04 [0003]  1       BB03                  658k   657978    [032..03E)                                     bwd IBC 
BB05 [0004]  2       BB03,BB04           10666k 10665989    [03E..044)                                     bwd IBC 
BB06 [0005]  2       BB02,BB05           10766k 10765989    [044..049)-> BB03 ( cond )                     bwd bwd-src IBC 
BB07 [0006]  1       BB06                  100k   100000    [049..055)-> BB09 ( cond )                     bwd IBC 
BB08 [0007]  1       BB07                100.00      100    [055..057)        (return)                     IBC 
BB09 [0008]  1       BB07                99900.    99900    [057..06D)                                     bwd IBC 
BB10 [0009]  2       BB01,BB09             100k   100000    [06D..071)-> BB02 ( cond )                     bwd bwd-src IBC 
BB11 [0010]  1       BB10                  0           0    [071..079)-> BB41 (always)                     rare IBC 
BB12 [0011]  1       BB41                  0           0    [079..07D)-> BB20 ( cond )                     rare bwd bwd-target IBC 
BB13 [0012]  1       BB12                  0           0    [07D..081)-> BB19 (always)                     rare bwd IBC 
BB14 [0013]  1       BB19                  0           0    [081..08B)-> BB18 ( cond )                     rare bwd bwd-target IBC 
BB15 [0014]  1       BB14                  0           0    [08B..090)-> BB17 (always)                     rare bwd IBC 
BB16 [0015]  1       BB17                  0           0    [090..0A7)                                     rare bwd bwd-target IBC 

Here we realize the 2 - 10 loop looks infinite based on OSR profile and try and adjust, but there are two exits 7->8 and 10->12 and we pick the first. So no profile makes it to the rest of the method. I wonder if better strategy in these cases is to blend rather than repair so that weights get distributed a bit more broadly.

And indeed, blend mode works beautifully here:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
BenchLUDecomp Job-DISSDZ Default 1.278 s 0.0249 s 0.0255 s 1.279 s 1.234 s 1.315 s 1.00 0.00 42000.0000 25000.0000 8000.0000 206.22 MB 1.00
BenchLUDecomp Job-ZTHQNP PGO+Blend 1.090 s 0.0187 s 0.0175 s 1.089 s 1.063 s 1.121 s 0.85 0.02 42000.0000 25000.0000 8000.0000 206.22 MB 1.00

Fixed by #85171 newplot - 2023-04-24T073127 086

AndyAyersMS commented 1 year ago

Benchstone.BenchI.CSieve.Test

Default [link]

newplot (40)

PGO [link]

newplot (41)


Performance seems to be at parity now

AndyAyersMS commented 1 year ago

System.Memory.Span.StartsWith(Size: 512)

Default [link]

newplot (42)

PGO [link]

newplot (43)


PGO looks to be at parity now or perhaps a bit faster.

AndyAyersMS commented 1 year ago

System.Collections.ContainsFalse.ImmutableArray(Size: 512)

Default [link]

newplot (44)

PGO [link]

newplot (45)


PGO performance seems comparable?

AndyAyersMS commented 1 year ago

PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj

Default [link]

newplot (46)

PGO [[link]]

newplot (48)

PGO seems comparable

AndyAyersMS commented 1 year ago

PerfLabTests.LowLevelPerf.GenericGenericMethod

Default [link]

newplot (49)

PGO [link]

newplot (50)


Looks to be tri-modal with PGO and Default showing similar levels.

AndyAyersMS commented 1 year ago

System.Perf_Convert.FromBase64String

Default [link]

newplot (51)

PGO [link]

newplot (52)


Can repro locally:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
FromBase64String Job-ZPWSJG Default 64.07 ns 1.903 ns 2.191 ns 63.74 ns 61.38 ns 69.18 ns 1.00 0.00 0.0087 56 B 1.00
FromBase64String Job-YVHGZZ PGO 68.80 ns 0.851 ns 0.796 ns 68.93 ns 67.31 ns 70.35 ns 1.08 0.03 0.0088 56 B 1.00
;; default

02.88%   1.2E+06     ?        Unknown
37.87%   1.579E+07   Tier-1   [System.Private.CoreLib]Convert.TryDecodeFromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&)
23.84%   9.94E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64_ComputeResultLength(wchar*,int32)
18.04%   7.52E+06    native   coreclr.dll
07.75%   3.23E+06    Tier-1   [System.Private.CoreLib]Convert.TryFromBase64Chars(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&)
04.41%   1.84E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64CharPtr(wchar*,int32)
02.30%   9.6E+05     Tier-1   [System.Private.CoreLib]Convert.FromBase64String(class System.String)

;; pgo

02.36%   1.08E+06    ?        Unknown
38.08%   1.742E+07   Tier-1   [System.Private.CoreLib]Convert.FromBase64CharPtr(wchar*,int32)
32.29%   1.477E+07   Tier-1   [System.Private.CoreLib]Convert.TryDecodeFromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&)
15.96%   7.3E+06     native   coreclr.dll
06.62%   3.03E+06    Tier-1   [System.Private.CoreLib]Convert.TryFromBase64Chars(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&)
02.45%   1.12E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64String(class System.String)

Looks like with PGO FromBase64CharPtr has now absorbed FromBase64_ComputeResultLength


Seems like this may be a case of the issue noted in #84319 where when we invert loops we don't get the profile updates done properly. Here we have a hot loop BB02 - BB06 with some cold blocks:

BB01 [0010]  1                           48888.   48888    [000..00D)-> BB06 (always)                     IBC 
BB02 [0011]  1       BB06                 1734k 1734078    [00D..01A)-> BB04 ( cond )                     bwd bwd-target IBC 
BB03 [0012]  1       BB02                  0          0    [01A..020)-> BB06 (always)                     rare bwd IBC 
BB04 [0013]  1       BB02                 1734k 1734078    [020..025)-> BB06 ( cond )                     bwd IBC 
BB05 [0014]  1       BB04                48545.   48545    [025..02D)                                     bwd IBC 
BB06 [0015]  4       BB01,BB03,BB04,BB05  1783k 1782966    [02D..031)-> BB02 ( cond )                     bwd bwd-src IBC 

Without PGO we end up compacting this (somehow), but with PGO we end up leaving one of the cold block inline

;; default 

       align    [12 bytes for IG03]

G_M650_IG03:  ;; offset=0020H
       movzx    r9, word  ptr [rcx]
       add      rcx, 2
       cmp      r9d, 32
       jbe      SHORT G_M650_IG12
                        ;; size=14 bbWeight=348.23 PerfScore 1218.79
G_M650_IG04:  ;; offset=002EH
       cmp      r9d, 61
       je       SHORT G_M650_IG11
                        ;; size=6 bbWeight=347.83 PerfScore 434.79
G_M650_IG05:  ;; offset=0034H
       cmp      rcx, rax
       jb       SHORT G_M650_IG03

...

G_M650_IG11:  ;; offset=0069H  **COLD**
       dec      edx
       inc      r8d
       jmp      SHORT G_M650_IG05

;; pgo

       align    [8 bytes for IG06]

G_M36625_IG06:  ;; offset=0060H
       movzx    rax, word  ptr [rdx]
       add      rdx, 2
       cmp      eax, 32
       jbe      G_M36625_IG15
       cmp      eax, 61
       jne      SHORT G_M36625_IG08
                        ;; size=21 bbWeight=35.07 PerfScore 166.59
G_M36625_IG07:  ;; offset=0075H **COLD**
       dec      ecx
       inc      ebp
                        ;; size=4 bbWeight=0.98 PerfScore 0.49
G_M36625_IG08:  ;; offset=0079H
       cmp      rdx, rbx
       jb       SHORT G_M36625_IG06

When we invert the exit block weight ends up being too low:

BB09 [0011]  2       BB13,BB29            35.47 1734078    [029..02A)-> BB11 ( cond )                     i bwd bwd-target IBC 
BB10 [0012]  1       BB09                  0          0    [029..02A)-> BB13 (always)                     i rare bwd IBC 
BB11 [0013]  1       BB09                 35.47 1734078    [029..02A)-> BB13 ( cond )                     i bwd IBC 
BB12 [0014]  1       BB11                  0.99   48545    [029..02A)                                     i bwd IBC 
BB13 [0015]  3       BB10,BB11,BB12        1      48888    [029..02A)-> BB09 ( cond )                     i bwd bwd-src IBC 

Verified that the prospective fix from #84319 works:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
FromBase64String Job-XFXANI FixLoopInversion 63.05 ns 0.716 ns 0.598 ns 62.83 ns 62.35 ns 64.43 ns 1.00 0.00 0.0088 56 B 1.00
FromBase64String Job-ITZXRC PGO 70.10 ns 1.172 ns 1.096 ns 70.16 ns 68.79 ns 72.12 ns 1.11 0.02 0.0088 56 B 1.00

Lab sees the fix now too:

newplot - 2023-05-18T180257 652

AndyAyersMS commented 1 year ago

System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4)

Default [link]

newplot (53)

PGO [link]

newplot (54)

Feb 2002 regression seems to be https://github.com/dotnet/runtime/pull/82412


Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
SequenceCompareToDifferent Job-HHXDGF \base-rel\corerun.exe 4 4.590 ns 0.1019 ns 0.1001 ns 4.543 ns 4.478 ns 4.791 ns 1.00 0.00 - NA
SequenceCompareToDifferent Job-UIGKEB \diff-rel\corerun.exe 4 4.817 ns 0.1166 ns 0.1033 ns 4.785 ns 4.700 ns 5.083 ns 1.05 0.03 - NA
;; default

03.21%   1.2E+06     ?        Unknown
36.40%   1.36E+07    Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceCompareToDifferent()
35.95%   1.343E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceCompareTo(unsigned int8&,int32,unsigned int8&,int32)
17.91%   6.69E+06    Tier-1   [System.Private.CoreLib]MemoryExtensions.SequenceCompareTo(value class System.Span`1<!!0>,value class System.ReadOnlySpan`1<!!0>)

;; pgo

38.51%   1.483E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceCompareToDifferent()
35.63%   1.372E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceCompareTo(unsigned int8&,int32,unsigned int8&,int32)
16.26%   6.26E+06    Tier-1   [System.Private.CoreLib]MemoryExtensions.SequenceCompareTo(value class System.Span`1<!!0>,value class System.ReadOnlySpan`1<!!0>)

Looks like slightly different PGO data for an System.ReadOnlySpan1[ubyte]:.ctor(ubyte[]):this` (static vs dynamic) leads to an LSRA block order difference which leads LSRA to use two callee saves in the PGO case.

;; default (static pgo)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0008]  1                           100    100    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0009]  1       BB01                  0.58   1    [003..00B)        (return)                     IBC 
BB03 [0010]  1       BB01                 99.42  99    [00B..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Start LSRA Block Sequence: 
Current block: BB01
    Succ block: BB03, Criteria: weight, Worklist: [BB03 ]
    Succ block: BB02, Criteria: weight, Worklist: [BB03 BB02 ]
Current block: BB03
    Succ block: BB04, Criteria: weight, Worklist: [BB04 BB02 ]
Current block: BB04
    Succ block: BB06, Criteria: bbNum, Worklist: [BB06 BB02 ]
    Succ block: BB05, Criteria: bbNum, Worklist: [BB05 BB06 BB02 ]
Current block: BB05
    Succ block: BB07, Criteria: bbNum, Worklist: [BB06 BB07 BB02 ]
Current block: BB06
Current block: BB07
Current block: BB02
Final LSRA Block Sequence:
BB01 (  1   )
BB03 (  0.50)
BB04 (  1   )
BB05 (  0.00)
BB06 (  0.50)
BB07 (  1   )
BB02 (  0   )

;; pgo

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0008]  1                           100    100    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0009]  1       BB01                  0      0    [003..00B)        (return)                     rare IBC 
BB03 [0010]  1       BB01                100    100    [00B..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Start LSRA Block Sequence: 
Current block: BB01
    Succ block: BB03, Criteria: weight, Worklist: [BB03 ]
    Succ block: BB02, Criteria: weight, Worklist: [BB03 BB02 ]
Current block: BB03
    Succ block: BB04, Criteria: weight, Worklist: [BB04 BB02 ]
Current block: BB04
    Succ block: BB06, Criteria: bbNum, Worklist: [BB06 BB02 ]
    Succ block: BB05, Criteria: weight, Worklist: [BB06 BB02 BB05 ]
Current block: BB06
    Succ block: BB07, Criteria: bbNum, Worklist: [BB07 BB02 BB05 ]
Current block: BB07
Current block: BB02
Current block: BB05
Final LSRA Block Sequence:
BB01 (  1   )
BB03 (  0.50)
BB04 (  1   )
BB06 (  0.50)
BB07 (  1   )
BB02 (  0   )
BB05 (  0   )

and the prologs


;; default
       push     rdi
       push     rsi
       sub      rsp, 72

;; pgo
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 72
AndyAyersMS commented 1 year ago

System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB)

Default [link]

newplot (55)

PGO [link]

newplot (56)

May 2022 (small) regression may have been https://github.com/dotnet/runtime/pull/68804 February 2023 regression may be https://github.com/dotnet/runtime/pull/77357 or https://github.com/dotnet/runtime/pull/81936 April 2023 regression possibly https://github.com/dotnet/runtime/pull/84427


Locally I can repro, but annoyingly when I profile PGO is faster.

Method Job Toolchain IsDataIndented TestRandomAccess TestCase Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
Parse Job-FHNYPZ Default False True Json400KB 4.592 ms 0.0531 ms 0.0444 ms 4.599 ms 4.515 ms 4.679 ms 1.00 0.00 92.5926 597.51 KB 1.00
Parse Job-LHXDIX PGO False True Json400KB 5.507 ms 0.9467 ms 1.0522 ms 4.760 ms 4.490 ms 6.996 ms 1.29 0.22 - 597.59 KB 1.00
;; default

65.72%   2.552E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
07.37%   2.86E+06    FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,int32,value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonElement&)
03.37%   1.31E+06    native   coreclr.dll
01.85%   7.2E+05     Tier-1   [System.Text.Json]JsonDocument.Parse(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonReaderOptions,value class MetadataDb&,value class StackRowStack&)
01.75%   6.8E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ReadSingleSegment()
01.75%   6.8E+05     Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
01.39%   5.4E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeString()

;; pgo

66.14%   2.531E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
07.66%   2.93E+06    FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,int32,value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonElement&)
03.29%   1.26E+06    native   coreclr.dll
02.59%   9.9E+05     Tier-1   [System.Text.Json]JsonDocument.Parse(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonReaderOptions,value class MetadataDb&,value class StackRowStack&)
01.86%   7.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNextToken(unsigned int8)
01.44%   5.5E+05     Tier-1   [System.Private.CoreLib]Ascii.WidenAsciiToUtf16(unsigned int8*,wchar*,unsigned int)
01.31%   5E+05       Tier-1   [System.Text.Json]Utf8JsonReader.ReadSingleSegment()
01.23%   4.7E+05     Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
01.12%   4.3E+05     FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,value class System.ReadOnlySpan`1<wchar>,value class System.Text.Json.JsonElement&)
01.07%   4.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeValue(unsigned int8)
01.07%   4.1E+05     native   clrjit.dll

Note the TryGetNamedPropertyValue methods bypass tiering. Looks like they have both stackalloc and a loop, so aren't eligible for OSR.


Codegen for GetArrayIndexElement is identical w/wo PGO, and next hottest method bypasses tiering, and I consistently get faster results with PGO running BDN -p ETW as admin, so am going to move on.

My only guess is that perhaps GetArrayIndexElement is running into JCC errata. I don't have PGO perf history for other machines so can't easily cross-check (say against arm64).


we now have some PGO data both on linux x64 and ampere arm64, for both of those PGO is comparable or a bit faster.

image
AndyAyersMS commented 1 year ago

System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 400000, unique: 1, cacheSize: 15)

Default [link]

newplot (57)

PGO [link]

newplot (58)


PGO seems consistently faster, so likely this was just a noisy comparison.

AndyAyersMS commented 1 year ago

System.Collections.IndexerSet<Int32>.Dictionary(Size: 512)

Default [link]

newplot (59)

PGO [link]

newplot (60)

Jan 2023 regression looks like https://github.com/dotnet/runtime/pull/80481


Locally I have PGO consistently faster:

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Dictionary Job-OHYCZK Default 512 3.506 us 0.1136 us 0.1216 us 3.491 us 3.328 us 3.792 us 1.00 0.00 - NA
Dictionary Job-NZIXNJ PGO 512 3.219 us 0.0467 us 0.0414 us 3.214 us 3.122 us 3.287 us 0.92 0.03 - NA

Both default and PGO have since improved and are showing comparable results in the lab:

Default newplot - 2023-05-18T181432 642

Improvement was one or more of:

PGO

newplot - 2023-05-18T181617 568

Improvement was likely from https://github.com/dotnet/runtime/pull/85805

AndyAyersMS commented 1 year ago

System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!")

Note in a windows batch script you may need to use

--filter "System.Tests.Perf_Char.Char_ToLowerInvariant(input: ""Hello World!"")"

to get this properly passed to BenchmarkDotNet.

Default [link]

newplot (61)

November 2022 improvement was https://github.com/dotnet/runtime/pull/78593 April 2023 regression is possibly https://github.com/dotnet/runtime/pull/83624

PGO [link]

newplot (62)

November 2022 regression was https://github.com/dotnet/runtime/pull/78593 April 2023 regression is possibly https://github.com/dotnet/runtime/pull/83624


So oddly enough these are reacting oppositely to some changes.

I see PGO as faster or at least comparable locally.

Method Job Toolchain input Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Char_ToLowerInvariant Job-ZSHNYR Default Hello World! 13.34 ns 0.143 ns 0.127 ns 13.36 ns 13.15 ns 13.52 ns 1.00 - NA
Char_ToLowerInvariant Job-QOYHZR PGO Hello World! 11.80 ns 0.137 ns 0.128 ns 11.76 ns 11.61 ns 11.99 ns 0.88 - NA
AndyAyersMS commented 1 year ago

The above covers the initial screen of the 50 worst ratios (as of 3 weeks ago). I am going to add in a few other cases based on a refreshed and slightly broadened query.

Broadened: the results above excluded any test that had less than a 2ns reported time. A lot of HW intrinsic/vector tests come in under this and report slower with PGO. So let's look at a few.

System.Numerics.Tests.Perf_Vector3.DotBenchmark

Default [link]

newplot (63)

PGO [link]

newplot (64)

Seems comparable, PGO result is just a hair greater than zero, but default is zero so ratio is infinite.

AndyAyersMS commented 1 year ago

System.Numerics.Tests.Perf_Vector3.ReflectBenchmark

Default [link]

newplot (65)

PGO [link]

newplot (66)

Regression appears to be from https://github.com/dotnet/runtime/pull/83036 which seems odd.


This one is still a mystery. Default and PGO seem to have identical code for the benchmark methods, but with release bits PGO reports consistently worse performance. If I drop in a checked jit into the release PGO build, it speeds up and matches the default perf.

Suspecting something like the following -- the workload action unroll for release PGO ends up always calling the precode somehow, and this never gets updated/backpatched?


Looks like the issue was that the jit was not fully initializing some of the SIMD12 vector constants used by this code, causing the high float to pick up random values, that might be denorms, which cause huge stalls.


Fix is #85362

newplot - 2023-04-27T014436 260

AndyAyersMS commented 1 year ago

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)

Default [link]

newplot - 2023-04-25T081437 916

January 2023 improvement is https://github.com/dotnet/runtime/pull/79346 February 2023 improvement is possibly https://github.com/dotnet/runtime/pull/79381

PGO [link]

newplot - 2023-04-25T081327 907

October 2022 regression is https://github.com/dotnet/runtime/pull/70941 January 2023 improvement looks like https://github.com/dotnet/runtime/pull/80481 February 2023 improvement possibly https://github.com/dotnet/runtime/pull/79381 April 2023 regression is from https://github.com/dotnet/runtime/pull/84427


This repros locally

Method Job Toolchain NumberOfBytes Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Base64DecodeInPlace Job-BGTHBT Default 200000000 123.6 ms 1.15 ms 0.89 ms 123.5 ms 122.3 ms 125.8 ms 1.00 0.00 400 B 1.00
Base64DecodeInPlace Job-DGNZZW PGO 200000000 261.5 ms 47.18 ms 50.48 ms 278.5 ms 123.0 ms 287.5 ms 2.04 0.49 400 B 1.00

But when I enable profiling -p ETW it vanishes

Method Job Toolchain NumberOfBytes Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Base64DecodeInPlace Job-ODJYCK Default 200000000 125.4 ms 6.63 ms 6.20 ms 123.7 ms 121.3 ms 141.2 ms 1.00 0.00 400 B 1.00
Base64DecodeInPlace Job-NQEVUA PGO 200000000 124.7 ms 4.63 ms 4.33 ms 123.4 ms 122.0 ms 139.8 ms 1.00 0.06 400 B 1.00

Seems like the "fast" run of this test executes R2R code and never tiers up:

image

(also looks like something is broken with R2R rundown as neither PerfView nor my tool can resolve which R2R method this is).

Think the issue is in BDN itself: https://github.com/dotnet/BenchmarkDotNet/issues/2296

So, why are we looking at R2R code here? (Potentially if I can catch a slow run in the profiler, it may show hits at Tier0, OSR, or Tier1...will keep trying).


Caught a slow run, and despite what I wrote above the "fast" runs look like they are indeed at Tier1.

;; default

86.93%   1.63E+07    Tier-1   [System.Private.CoreLib]Base64.DecodeFromUtf8InPlace(value class System.Span`1<unsigned int8>,int32&)
13.07%   2.45E+06    native   System.Private.CoreLib.dll

;; pgo

48.42%   2.669E+07   native   coreclr.dll
47.01%   2.591E+07   Tier-1 instr  [System.Private.CoreLib]Base64.DecodeFromUtf8InPlace(value class System.Span`1<unsigned int8>,int32&)
04.44%   2.45E+06    native   System.Private.CoreLib.dll
00.09%   5E+04       native   ntoskrnl.exe

The hot method is evidently R2R and so we go to Tier-1 instr with PGO but it doesn't get called enough to ever finish tiering up. If I rev up the iterations ....

WorkloadResult  29: 1 op, 283866900.00 ns, 283.8669 ms/op
WorkloadResult  30: 1 op, 291856400.00 ns, 291.8564 ms/op
WorkloadResult  31: 1 op, 286046600.00 ns, 286.0466 ms/op
WorkloadResult  32: 1 op, 287347400.00 ns, 287.3474 ms/op
WorkloadResult  33: 1 op, 121137000.00 ns, 121.1370 ms/op
WorkloadResult  34: 1 op, 121175600.00 ns, 121.1756 ms/op
WorkloadResult  35: 1 op, 122286200.00 ns, 122.2862 ms/op
WorkloadResult  36: 1 op, 123082000.00 ns, 123.0820 ms/op
WorkloadResult  37: 1 op, 122119700.00 ns, 122.1197 ms/op

as you can see we need to get to iteration 33 to switch to Tier1 with PGO, but by default we only run 15 iterations. And once we switch we get basically the same perf as without PGO.

BDN could probably do a better job here of warning you that if there are less than say 30 invocations (perhaps 60 to be safe) of your benchmark method you may not see the results you're expecting, and either need to drop the per-invocation time or up the invocation or iteration counts.

AndyAyersMS commented 1 year ago

Benchmark.GetChildKeysTests.AddChainedConfigurationEmpty

Default [link]

newplot - 2023-04-25T082718 128

April 2023 improvement possibly https://github.com/dotnet/runtime/pull/84582

PGO [link]

newplot - 2023-04-25T082745 458

October 2022 improvement was https://github.com/dotnet/runtime/pull/70941 February 2023 improvement was https://github.com/dotnet/runtime/pull/81267 April 2023 regression was possibly https://github.com/dotnet/runtime/pull/84582


Comparable for me locally

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
AddChainedConfigurationEmpty Job-SNZWKD \base-rel\corerun.exe 13.41 ms 0.263 ms 0.270 ms 13.30 ms 13.18 ms 14.01 ms 1.00 0.00 143.24 KB 1.00
AddChainedConfigurationEmpty Job-AMBTRB \diff-rel\corerun.exe 13.87 ms 1.199 ms 1.380 ms 13.10 ms 12.58 ms 16.12 ms 1.05 0.10 143.32 KB 1.00
AndyAyersMS commented 1 year ago

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)

Default [link]

newplot - 2023-05-05T075647 419

Recent improvement was from https://github.com/dotnet/runtime/pull/85277

PGO [link]

newplot - 2023-05-05T075552 000

Regression was from something in this range: https://github.com/dotnet/runtime/compare/887c043eb94be364188e2b23a87efa214ea57f1e...f92b9ef636a4cc9599b33446e30e7a489591ca46 (74 commits)


From Azure Data Explorer, regression was after 2023-04-25T00:29:35Z and before 2023-04-25T03:46:39Z

image

Tue Apr 25 03:46:39 2023 f74138f056a Jeremy Koritzinsky Enable ComWrappers for NativeAOT on non-Windows (#85000) Tue Apr 25 03:33:48 2023 ad5967230c8 Jeremy Koritzinsky Build Mono LLVM legs on the CBL-Mariner images, but run the AOTing steps on CentOS Stream 8 with binutils (#85273) Tue Apr 25 02:51:18 2023 516aa8c9eab Stephen Toub Add Parallel.ForAsync (#84804) Tue Apr 25 02:50:59 2023 759fabe4a25 Stephen Toub Fix perf regressions in Utf8Formatter for integers (#85277) Tue Apr 25 01:14:10 2023 01e2455aa53 Michał Petryka Add CollectionsMarshal.SetCount(list, count) (#82146) Tue Apr 25 00:49:02 2023 41f6e792f33 Aleksey Kliger (λgeek) [native-library] If a dllimport is specified with an absolute path, look for it first (#85255) Tue Apr 25 00:29:35 2023 620b0db3a98 Stephen Toub Add parsing error information to composite format string parsing exceptions (#85106)

So looks like the change that lead to this was https://github.com/dotnet/runtime/pull/85277.


Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
FormatterUInt64 Job-WWTDFM PGO 0 6.214 ns 0.0098 ns 0.0082 ns 6.213 ns 6.203 ns 6.231 ns 2.91 0.03 - NA
FormatterUInt64 Job-QWMANF Default 0 2.137 ns 0.0243 ns 0.0227 ns 2.148 ns 2.082 ns 2.162 ns 1.00 0.00 - NA

;; default

05.74%   2.18E+06    ?        Unknown
81.72%   3.103E+07   Tier-1   [MicroBenchmarks]Utf8FormatterTests.FormatterUInt64(unsigned int64)
11.72%   4.45E+06    Tier-1   [9c1446ac-187e-4ddf-b6bd-964603a8e2dd]Runnable_0.WorkloadActionUnroll(int64)

;; pgo

02.50%   1.97E+06    ?        Unknown
93.66%   7.369E+07   Tier-1   [MicroBenchmarks]Utf8FormatterTests.FormatterUInt64(unsigned int64)
03.23%   2.54E+06    Tier-1   [ea5814fd-a81b-4c82-bc90-c88fd94286d8]Runnable_0.WorkloadActionUnroll(int64)

Looks like all the time is spent in the benchmark method. Hopefully this will be easy to track down.


Seems like what is happening is that with PGO we see a lot of cold blocks, and so we don't inline some methods that we inline without PGO. One of those takes a struct by address that is also used in hot code, and so we don't fully promote, we dependently promote. And then we make some poor choices in codegen that leads to what is likely a partial stall. Sampling shows a lot of time piling up in this little stretch of code:

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       mov      eax, dword ptr [rsp+48H]
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

For some reason we do a wide load after a narrow store; my recollection is that the HW does not like this at all. In the non PGO this code doesn't exist because we know all the values are zero.


Looks like we widen this load in fgMorphCastedBitwiseOp. If I fix that to not widen normalize on load locals, then we recover most (but not all) of the perf. Suspect the remainder is in the higher prolog/epilog costs we see with PGO.

Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
FormatterUInt64 Job-FINTBR PGO/Fixed 0 2.516 ns 0.0596 ns 0.0558 ns 2.506 ns 2.455 ns 2.631 ns 1.16 0.04 - NA
FormatterUInt64 Job-EZPSRN Default 0 2.159 ns 0.0497 ns 0.0441 ns 2.149 ns 2.097 ns 2.254 ns 1.00 0.00 - NA

PGO has now improved, thanks to https://github.com/dotnet/runtime/pull/86491 fixing the store-foreward stall.

newplot - 2023-05-21T075034 050

Default "regressed" because we enabled PGO by default. then improved with https://github.com/dotnet/runtime/pull/86491

newplot - 2023-05-21T075348 361

AndyAyersMS commented 1 year ago

Closing this; remaining regressions will be analyzed in https://github.com/dotnet/runtime/issues/87194.