dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.01k stars 4.67k forks source link

Performance regression: PerfLabTests.StackWalk.Walk #13136

Open adamsitnik opened 5 years ago

adamsitnik commented 5 years ago

The PerfLabTests.StackWalk.Walk is very unique, it uses GC.Collect(0) to test the performance of stack walking. So the regression can be related to stack walking or GC.

The benchmark was run using BenchmarkDotNet which by default runs the benchmarks using default GC settings for console app.

git clone https://github.com/dotnet/performance.git
cd performance
# if you don't have cli installed and want python script to download the latest cli for you
py .\scripts\benchmarks_ci.py -f netcoreapp2.2 netcoreapp3.0 --filter PerfLabTests.StackWalk.Walk
# if you do
dotnet run -p .\src\benchmarks\micro\MicroBenchmarks.csproj -c Release -f netcoreapp2.2 --runtimes netcoreapp2.2 netcoreapp3.0 --filter PerfLabTests.StackWalk.Walk

PerfLabTests.StackWalk.Walk

Conclusion Base Diff Base/Diff Modality Operating System Bit Processor Name Base Runtime Diff Runtime
Slower 16831730.00 22812525.00 0.74 Windows 10.0.18362 64bit Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-09
Slower 19752015.38 25708110.00 0.77 Windows 10.0.18362 64bit Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-09
Slower 18955330.77 24115910.00 0.79 Windows 10.0.18362 64bit Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 2.2.6 .NET Core 3.0.0-preview8-28373-12
Slower 22372690.91 28345400.00 0.79 Windows 10.0.18362 64bit AMD Ryzen 7 1800X .NET Core 2.2.6 .NET Core 3.0.0-preview8-28373-12
Slower 15101350.16 20760496.15 0.73 bimodal Windows 10.0.18362 64bit AMD Ryzen 9 3900X .NET Core 2.2.6 .NET Core 3.0.0-preview8-28373-12
Slower 20266117.58 26431496.20 0.77 bimodal ubuntu 18.04 64bit Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-09
Slower 36022729.43 42040344.13 0.86 ubuntu 16.04 64bit Intel Xeon CPU E5-2673 v4 2.30GHz .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-01
Slower 34662521.43 44050641.67 0.79 bimodal ubuntu 18.04 64bit Intel Xeon CPU E5-2673 v4 2.30GHz .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-01
Slower 25824391.50 30038940.78 0.86 macOS Mojave 10.14.5 64bit Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-09
Slower 15319558.75 17965150.00 0.85 Windows 10.0.18362 32bit Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-09
Slower 17350257.14 20233740.00 0.86 Windows 10.0.18362 32bit Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 2.2.6 .NET Core 3.0.0-preview8-28373-17
Same 112218444.50 112609946.50 1.00 ubuntu 18.04 64bit ARMv8 Processor rev 1 (v8l) .NET Core 2.2.6 .NET Core 3.0.0-preview8-27919-09
Same 224202834.00 221191386.50 1.01 bimodal ubuntu 14.04 32bit ARMv7 Processor rev 3 (v7l) .NET Core 2.2.6 .NET Core 3.0.0-preview8-28373-17

/cc @danmosemsft @janvorli @Maoni0 @billwert @DrewScoggins

Maoni0 commented 5 years ago

@billwert @DrewScoggins will you be looking at this?

billwert commented 5 years ago

@Maoni0 not unless you need help. If you download the perf repo on Windows you can use the built in ETW profiling to give you easy profiles (-p ETW) on the command line.

Maoni0 commented 5 years ago

we dunno if this is caused by the GC or not. I was hoping someone would figure that out first.

billwert commented 5 years ago

@adamsitnik do you happen to have traces? If not I can take a quick look.

adamsitnik commented 5 years ago

I've run the benchmark with profiler enabled for latest 2.2 and 3.0 and filtered it to a single (last) iteration that was doing 10 benchmark invocations.

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp2.2 --filter *StackWalk* --bdn-arguments "--profiler ETW --invocationCount 10 --warmupCount 3 --unrollFactor 1""
py .\performance\scripts\benchmarks_ci.py -f netcoreapp3.0 --filter *StackWalk* --bdn-arguments "--profiler ETW --invocationCount 10 --warmupCount 3 --unrollFactor 1"

Top 2.2 methods:

obraz

Top 3.0 methods:

obraz

Callers of the most time consuming method (ntdll!LdrpDispatchUserCallTarget):

obraz

The output of "Regression" PerfView feature:

### Overweight report for symbols common between both files Base (old) Time: | 238.7 -- | -- Test (new) Time: | 332.4 Delta: | 93.7 Delta %: | 39.3 In this report, overweight is ratio of actual growth compared to 39.3%. Interest level attempts to identify smaller methods which changed a lot. These are likely the most interesting frames to sart investigating An overweight of greater than 100% indicates the symbol grew in cost more than average. High overweights are a likely source of regressions and represent good places to investigate. Only symbols that have at least 2% impact are shown. Name | Base | Test | Delta | Responsibility % | Overweight % | Interest Level -- | -- | -- | -- | -- | -- | -- coreclr!ClrFlsSetValue | 1.0 | 20.0 | 19.0 | 20.27 | 4838.35 | 5 ntoskrnl!NtSetEvent | 3.0 | 22.0 | 19.0 | 20.27 | 1612.79 | 5 ntoskrnl!KeSetEvent | 2.0 | 19.0 | 17.0 | 18.14 | 2164.53 | 4 coreclr!ThreadSuspend::RestartEE | 2.0 | 15.0 | 13.0 | 13.87 | 1655.23 | 4 coreclr!McTemplateCoU0qqh | 2.0 | 13.0 | 11.0 | 11.74 | 1400.58 | 4 ntoskrnl!ExAcquireRundownProtectionCacheAwareEx | 1.0 | 6.0 | 5.0 | 5.34 | 1273.25 | 4 coreclr!??ThreadSuspend::SuspendEE | 2.0 | 12.0 | 10.0 | 10.67 | 1273.25 | 4 kernelbase!SetEvent | 4.0 | 22.0 | 18.0 | 19.21 | 1145.93 | 4 ntdll!NtSetEvent | 4.0 | 22.0 | 18.0 | 19.21 | 1145.93 | 4 coreclr!StackFrameIterator::ProcessIp | 4.3 | 15.0 | 10.7 | 11.40 | 629.82 | 4 ntdll!LdrpDispatchUserCallTarget | 6.0 | 20.0 | 14.0 | 14.94 | 594.18 | 4 coreclr!EECodeInfo::Init | 4.3 | 14.0 | 9.7 | 10.33 | 570.85 | 4 coreclr!WKS::gc_heap::do_post_gc | 5.0 | 15.0 | 10.0 | 10.67 | 509.30 | 4 coreclr!GcSlotDecoder::DecodeSlotTable | 5.0 | 13.0 | 8.0 | 8.54 | 407.44 | 4 coreclr!GcInfoDecoder::EnumerateLiveSlots | 15.0 | 28.0 | 13.0 | 13.87 | 220.70 | 4 coreclr!ThreadSuspend::SuspendRuntime | 6.0 | 11.0 | 5.0 | 5.34 | 212.21 | 4 coreclr!ThreadSuspend::SuspendEE | 8.0 | 14.0 | 6.0 | 6.40 | 190.99 | 4 coreclr!McTemplateCoU0xdqqqqhqq | 8.0 | 14.0 | 6.0 | 6.40 | 190.99 | 4 coreclr!WKS::gc_heap::garbage_collect | 206.7 | 243.3 | 36.7 | 39.11 | 45.17 | 4 coreclr!WKS::gc_heap::gc1 | 200.7 | 234.3 | 33.7 | 35.91 | 42.71 | 4 ntoskrnl!NtGetWriteWatch | 10.0 | 3.0 | -7.0 | -7.47 | -178.26 | 4 kernelbase!GetWriteWatch | 11.0 | 3.0 | -8.0 | -8.54 | -185.20 | 4 ntdll!NtGetWriteWatch | 11.0 | 3.0 | -8.0 | -8.54 | -185.20 | 4 coreclr!HndScanHandlesForGC | 0.3 | 2.3 | 2.0 | 2.10 | 1507.87 | 3 coreclr!TableScanHandles | 0.3 | 2.3 | 2.0 | 2.10 | 1507.87 | 3 ntoskrnl!NtFlushProcessWriteBuffers | 1.0 | 3.0 | 2.0 | 2.13 | 509.30 | 3 ntdll!NtQueryInformationThread | 2.0 | 5.0 | 3.0 | 3.20 | 381.98 | 3 kernelbase!GetThreadPriority | 2.0 | 5.0 | 3.0 | 3.20 | 381.98 | 3 coreclr!GCToCLREventSink::FireGCHeapStats_V1 | 3.0 | 7.0 | 4.0 | 4.27 | 339.53 | 3 coreclr!McTemplateCoU0xxxxxxxxxxqqqh | 3.0 | 7.0 | 4.0 | 4.27 | 339.53 | 3 ntdll!NtFlushProcessWriteBuffers | 2.0 | 4.0 | 2.0 | 2.13 | 254.65 | 3 ntoskrnl!MiMoveDirtyBitsToPfns | 2.0 | 4.0 | 2.0 | 2.13 | 254.65 | 3 ntoskrnl!KiSystemServiceUser | 7.0 | 11.1 | 4.1 | 4.34 | 147.84 | 3 ntoskrnl!KiPageFault | 4.0 | 6.0 | 2.0 | 2.13 | 127.33 | 3 coreclr!McGenEventWriteUM | 50.0 | 67.0 | 17.0 | 18.14 | 86.58 | 3 ntdll!EtwpEventWriteFull | 49.0 | 65.0 | 16.0 | 17.07 | 83.15 | 3 ntdll!NtTraceEvent | 46.0 | 61.0 | 15.0 | 16.01 | 83.04 | 3 ntdll!EtwEventWrite | 50.0 | 66.0 | 16.0 | 17.07 | 81.49 | 3 coreclr!EECodeManager::EnumGcRefs | 35.0 | 45.0 | 10.0 | 10.67 | 72.76 | 3 coreclr!GCInterface::Collect | 237.7 | 305.4 | 67.7 | 72.26 | 72.56 | 3 coreclr!WKS::GCHeap::GarbageCollect | 236.7 | 303.4 | 66.7 | 71.19 | 71.79 | 3 microbenchmarks!PerfLabTests.StackWalk.Y(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.Z(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.X(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.S(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.V(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.Walk() | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.A(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.B(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.C(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.D(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.E(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.F(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.G(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.H(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.W(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.J(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.I(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.L(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.M(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.N(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.O(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.P(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.Q(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.R(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.T(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.U(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 microbenchmarks!PerfLabTests.StackWalk.K(int32) | 238.3 | 305.1 | 66.7 | 71.22 | 71.32 | 3 dotnet!wmain | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 dotnet!exe_start | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 dotnet!__scrt_common_main_seh | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 hostfxr!hostfxr_main_startupinfo | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 hostfxr!fx_muxer_t::execute | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 hostfxr!fx_muxer_t::handle_exec_host_command | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!coreclr_execute_assembly | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!RuntimeMethodHandle::InvokeMethod | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 benchmarkdotnet!BenchmarkDotNet.Engines.EngineStage.RunIteration(value class BenchmarkDotNet.Engines.IterationMode,value class BenchmarkDotNet.Engines.IterationStage,int32,int64,int32) | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 benchmarkdotnet!BenchmarkDotNet.Engines.EngineActualStage.RunAuto(int64,value class BenchmarkDotNet.Engines.IterationMode,int32) | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 benchmarkdotnet!BenchmarkDotNet.Engines.EngineActualStage.Run(int64,value class BenchmarkDotNet.Engines.IterationMode,bool,int32,bool) | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 benchmarkdotnet!BenchmarkDotNet.Engines.Engine.Run() | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!CallDescrWorkerReflectionWrapper | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 hostfxr!execute_app | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!CallDescrWorkerInternal | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!MethodDescCallSite::CallTargetWorker | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 system.private.corelib!System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo) | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 hostfxr!`anonymous namespace'::read_config_and_execute | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 hostpolicy!corehost_main | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!CorHost2::ExecuteAssembly | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!Assembly::ExecuteMainMethod | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!RunMain | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 benchmarkdotnet!BenchmarkDotNet.Engines.Engine.RunIteration(value class BenchmarkDotNet.Engines.IterationData) | 238.7 | 305.4 | 66.7 | 71.19 | 71.19 | 3 coreclr!WKS::GCHeap::GarbageCollectGeneration | 235.7 | 299.4 | 63.7 | 67.99 | 68.85 | 3 ntoskrnl!EtwpWriteUserEvent | 31.0 | 38.0 | 7.0 | 7.47 | 57.50 | 3 coreclr!GcStackCrawlCallBack | 49.0 | 56.0 | 7.0 | 7.47 | 36.38 | 3 ntoskrnl!NtTraceEvent | 40.0 | 45.0 | 5.0 | 5.34 | 31.83 | 3 coreclr!Thread::StackWalkFramesEx | 124.3 | 139.0 | 14.7 | 15.67 | 30.07 | 3 coreclr!WKS::gc_heap::mark_phase | 173.7 | 168.3 | -5.3 | -5.70 | -7.84 | 3 coreclr!StackFrameIterator::ProcessCurrentFrame | 40.0 | 35.0 | -5.0 | -5.34 | -31.83 | 3 coreclr!GcInfoDecoder::GcInfoDecoder | 25.0 | 20.0 | -5.0 | -5.34 | -50.93 | 3 ntdll!NtClearEvent | 4.0 | 2.1 | -1.9 | -2.07 | -123.26 | 3 kernelbase!ResetEvent | 4.0 | 2.1 | -1.9 | -2.07 | -123.26 | 3 coreclr!WKS::gc_heap::do_pre_gc | 3.0 | 1.0 | -2.0 | -2.13 | -169.77 | 3 ROOT | 238.7 | 332.4 | 93.7 | 100.00 | 100.00 | 2 ntdll!RtlUserThreadStart | 238.7 | 332.4 | 93.7 | 100.00 | 100.00 | 2 kernel32!BaseThreadInitThunk | 238.7 | 332.4 | 93.7 | 100.00 | 100.00 | 2 coreclr!McTemplateCoU0h | 7.0 | 9.0 | 2.0 | 2.13 | 72.76 | 2 ntoskrnl!EtwpReserveTraceBuffer | 15.0 | 17.0 | 2.0 | 2.13 | 33.95 | 2 coreclr!StackFrameIterator::NextRaw | 58.3 | 55.0 | -3.3 | -3.54 | -14.49 | 2 coreclr!McTemplateCoU0qhqx | 19.0 | 17.0 | -2.0 | -2.13 | -26.81 | 2 coreclr!Thread::VirtualUnwindCallFrame | 14.0 | 12.0 | -2.0 | -2.13 | -36.38 | 2 coreclr!StackFrameIterator::CheckForSkippedFrames | 27.0 | 23.0 | -4.0 | -4.27 | -37.73 | 2 coreclr!EECodeManager::GetParamContextType | 10.0 | 8.0 | -2.0 | -2.13 | -50.93 | 2 ntoskrnl!ObReferenceObjectByHandle | 8.0 | 6.0 | -2.0 | -2.13 | -63.66 | 2 coreclr!EECodeManager::GetGSCookieAddr | 10.0 | 7.0 | -3.0 | -3.20 | -76.40 | 2 coreclr!WKS::gc_heap::mark_through_cards_for_segments | 6.0 | 4.0 | -2.0 | -2.13 | -84.88 | 2 coreclr!??ThreadSuspend::RestartEE | 6.0 | 4.0 | -2.0 | -2.13 | -84.88 | 2 coreclr!GcInfoDecoder::FindSafePoint | 6.0 | 4.0 | -2.0 | -2.13 | -84.88 | 2 ntdll!RtlVirtualUnwind | 10.0 | 6.0 | -4.0 | -4.27 | -101.86 | 2 coreclr!StackFrameIterator::Init | 5.0 | 3.0 | -2.0 | -2.13 | -101.86 | 2 coreclr!WKS::gc_heap::find_first_object | 5.0 | 3.0 | -2.0 | -2.13 | -101.86 | 2
jkotas commented 5 years ago

Most of this is due to tiered JITing. If you run this with COMPlus_TieredCompilation=0, most of the regression will disappear.

The benchmark does not run enough iterations to hit Tier 1. Tier 0 code is bigger. Bigger code has bigger GC and unwind info. Bigger GC and unwind info takes longer to crack.

adamsitnik commented 5 years ago

The benchmark does not run enough iterations to hit Tier 1.

I think that it does run enough iterations to hit Tier 1.

Tier 0 code is bigger. Bigger code has bigger GC and unwind info. Bigger GC and unwind info takes longer to crack.

Do you mean that the code that calls the benchmark (Program.Main etc) itself was called only once, have not been promoted to Tier 1 and hence has bigger GC and Unwind info and takes longer to crack?

Sth like:

Main() // Tier 0
HelperMethodA() // Tier 0
HelperMethodB() // Tier 0
HelperMethodC() // Tier 0
HelperMethodD() // Tier 0
Benchmark() // Tier 1

Even with a very simple hand-written benchmark there is a visible difference:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace StackWalker
{
    class Program
    {
        static void Main()
        {
            Console.WriteLine($"{RuntimeInformation.FrameworkDescription} {RuntimeInformation.OSArchitecture} {RuntimeInformation.OSDescription}");
            Console.WriteLine($"GCSettings.IsServerGC={GCSettings.IsServerGC} GCSettings.LatencyMode={GCSettings.LatencyMode}");
            Console.WriteLine($"COMPlus_TieredCompilation={Environment.GetEnvironmentVariable("COMPlus_TieredCompilation")}");
            Console.WriteLine();

            List<long> resultsMs = new List<long>();
            Stopwatch watch = Stopwatch.StartNew();
            StackWalk sut = new StackWalk();

            for (int i = 0; i <= 62; i++)
            {
                watch.Restart();
                for (int j = 0; j < 12; j++)
                    sut.Walk();
                watch.Stop();

                Console.Write($"{i.ToString("00")}-{watch.ElapsedMilliseconds} ");

                if (i != 0) // don't include JIT cost in the results
                    resultsMs.Add(watch.ElapsedMilliseconds);

                if (i % 31 == 0)
                    Console.WriteLine();
            }

            Console.WriteLine();
            Console.WriteLine($"Iterations 01-31: Avg: {resultsMs.Take(31).Average().ToString("#.00")}");
            Console.WriteLine($"Iterations 32-62: Avg: {resultsMs.Skip(31).Average().ToString("#.00")}");
        }
    }

    public class StackWalk
    {
        public static int InnerIterationCount = 1000;

        public void Walk() => A(5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int A(int a) => B(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int B(int a) => C(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int C(int a) => D(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int D(int a) => E(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int E(int a) => F(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int F(int a) => G(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int G(int a) => H(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int H(int a) => I(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int I(int a) => J(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int J(int a) => K(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int K(int a) => L(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int L(int a) => M(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int M(int a) => N(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int N(int a) => O(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int O(int a) => P(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int P(int a) => Q(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int Q(int a) => R(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int R(int a) => S(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int S(int a) => T(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int T(int a) => U(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int U(int a) => V(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int V(int a) => W(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int W(int a) => X(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int X(int a) => Y(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int Y(int a) => Z(a + 5);
        [MethodImpl(MethodImplOptions.NoInlining)] private static int Z(int a)
        {
            for (int i = 0; i < InnerIterationCount; i++)
                GC.Collect(0);

            return 55;
        }
    }
}

COMPlus_TieredCompilation=0

.NET Core 3.0.0-preview8-27919-09 X64 Microsoft Windows 10.0.18362
GCSettings.IsServerGC=False GCSettings.LatencyMode=Interactive
COMPlus_TieredCompilation=0

00-206
01-210 02-209 03-206 04-212 05-211 06-206 07-206 08-202 09-206 10-206 11-205 12-206 13-205 14-206 15-204 16-200 17-199 18-200 19-199 20-200 21-199 22-199 23-200 24-201 25-200 26-200 27-199 28-200 29-200 30-200 31-201
32-200 33-200 34-200 35-200 36-200 37-201 38-200 39-200 40-199 41-200 42-200 43-199 44-200 45-200 46-200 47-199 48-200 49-181 50-177 51-176 52-176 53-177 54-187 55-186 56-186 57-187 58-190 59-196 60-189 61-189 62-186

Iterations 01-31: Avg: 203.13
Iterations 32-62: Avg: 192.94

COMPlus_TieredCompilation=1

.NET Core 3.0.0-preview8-27919-09 X64 Microsoft Windows 10.0.18362
GCSettings.IsServerGC=False GCSettings.LatencyMode=Interactive
COMPlus_TieredCompilation=1

00-229
01-220 02-290 03-233 04-220 05-234 06-254 07-265 08-279 09-224 10-216 11-213 12-217 13-215 14-211 15-210 16-211 17-213 18-237 19-210 20-210 21-210 22-211 23-216 24-216 25-212 26-211 27-211 28-211 29-211 30-264 31-238
32-239 33-238 34-237 35-280 36-238 37-293 38-233 39-233 40-232 41-233 42-233 43-209 44-228 45-229 46-227 47-232 48-234 49-236 50-264 51-236 52-210 53-210 54-208 55-207 56-207 57-208 58-207 59-208 60-208 61-207 62-207

Iterations 01-31: Avg: 225.58
Iterations 32-62: Avg: 228.10

I want to understand if this is a well-known trade-off that comes with TieredJIT or something that we should improve.

/cc @kouvel

jkotas commented 5 years ago

I think that it does run enough iterations to hit Tier 1.

I should have been more precise: The test is keeping the runtime suspended pretty much the whole time. Tiered JIT that runs on background threads is making very little progress (on my machine at least) because of the background threads are suspended most of time as well.

Try adding Thread.Sleep after the Console.WriteLine to your handwritten benchmark to unblock the background threads:

...
                Console.Write($"{i.ToString("00")}-{watch.ElapsedMilliseconds} ");

                Thread.Sleep(10);
...

You should see completely different numbers once you do that.

kouvel commented 5 years ago

It looks like some of the gap is due to FlushProcessWriteBuffers() taking longer when the background JIT thread is active.

TC=0

Name                                    Inc %      Inc  Exc %   Exc Fold                                 When        First        Last
 coreclr!ThreadSuspend::SuspendRuntime    4.6      381    0.0     3    0     00000000000000000000000000000000       24.919   8,312.466
+ ntdll!ZwFlushProcessWriteBuffers        1.7      142    0.7    57    0     0000000o00000o00000o0o00o00000o0       28.923   8,287.276
+ kernelbase!GetThreadPriority            1.6      132    0.0     2    0     0oo0000000o00000000oo000000_0_o0       24.919   8,302.387
+ kernelbase!ResetEvent                   1.2      103    0.0     1    0     0o000o0000oo00000_00o0o00o_o00o0       50.043   8,312.466
+ coreclr!StressLog::LogOn                0.0        1    0.0     1    0     ________________o_______________    4,339.251   4,340.251

TC=1

Name                                    Inc %      Inc  Exc %   Exc Fold                                 When        First        Last
 coreclr!ThreadSuspend::SuspendRuntime    7.0      621    0.0     3    0     00000010000000000000000001100001       20.467   8,378.572
+ ntdll!ZwFlushProcessWriteBuffers        3.7      327    0.4    37    0     00000o000000o00o0000000000000000       40.425   8,378.572
+ kernelbase!GetThreadPriority            1.9      169    0.0     1    0     00000000o0000000000000o0o000o000       20.467   8,366.728
+ kernelbase!ResetEvent                   1.3      117    0.0     0    0     0000000o00oo000o0000000oo0000o0o       41.523   8,253.240
+ coreclr!Thread::SuspendThread           0.0        3    0.0     0    0     ______o___________________o__o__    1,584.920   7,604.332
+ coreclr!ThreadStore::GetAllThreadList   0.0        2    0.0     2    0     __________o________o____________    2,680.105   5,162.297

Even if the thread is just spinning and not jitting anything, there is a noticeable increase in time. Putting the background thread to sleep seems to eliminate most of the gap even when everything is tier 0. With the suspends it takes a while for the background JIT thread to become stably inactive.

Probably for now an easy workaround would be to increase warmup iterations for this test. A better workaround may be to force more aggressive tiering-up or for now to disable tiering for the test. Is there a way to send environment variables to out-of-proc runs to configure the runtime?