Open adamsitnik opened 5 years ago
@billwert @DrewScoggins will you be looking at this?
@Maoni0 not unless you need help. If you download the perf repo on Windows you can use the built in ETW profiling to give you easy profiles (-p ETW
) on the command line.
we dunno if this is caused by the GC or not. I was hoping someone would figure that out first.
@adamsitnik do you happen to have traces? If not I can take a quick look.
I've run the benchmark with profiler enabled for latest 2.2 and 3.0 and filtered it to a single (last) iteration that was doing 10 benchmark invocations.
git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp2.2 --filter *StackWalk* --bdn-arguments "--profiler ETW --invocationCount 10 --warmupCount 3 --unrollFactor 1""
py .\performance\scripts\benchmarks_ci.py -f netcoreapp3.0 --filter *StackWalk* --bdn-arguments "--profiler ETW --invocationCount 10 --warmupCount 3 --unrollFactor 1"
Top 2.2 methods:
Top 3.0 methods:
Callers of the most time consuming method (ntdll!LdrpDispatchUserCallTarget
):
The output of "Regression" PerfView feature:
Most of this is due to tiered JITing. If you run this with COMPlus_TieredCompilation=0
, most of the regression will disappear.
The benchmark does not run enough iterations to hit Tier 1. Tier 0 code is bigger. Bigger code has bigger GC and unwind info. Bigger GC and unwind info takes longer to crack.
The benchmark does not run enough iterations to hit Tier 1.
I think that it does run enough iterations to hit Tier 1.
Tier 0 code is bigger. Bigger code has bigger GC and unwind info. Bigger GC and unwind info takes longer to crack.
Do you mean that the code that calls the benchmark (Program.Main
etc) itself was called only once, have not been promoted to Tier 1 and hence has bigger GC and Unwind info and takes longer to crack?
Sth like:
Main() // Tier 0
HelperMethodA() // Tier 0
HelperMethodB() // Tier 0
HelperMethodC() // Tier 0
HelperMethodD() // Tier 0
Benchmark() // Tier 1
Even with a very simple hand-written benchmark there is a visible difference:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
namespace StackWalker
{
class Program
{
static void Main()
{
Console.WriteLine($"{RuntimeInformation.FrameworkDescription} {RuntimeInformation.OSArchitecture} {RuntimeInformation.OSDescription}");
Console.WriteLine($"GCSettings.IsServerGC={GCSettings.IsServerGC} GCSettings.LatencyMode={GCSettings.LatencyMode}");
Console.WriteLine($"COMPlus_TieredCompilation={Environment.GetEnvironmentVariable("COMPlus_TieredCompilation")}");
Console.WriteLine();
List<long> resultsMs = new List<long>();
Stopwatch watch = Stopwatch.StartNew();
StackWalk sut = new StackWalk();
for (int i = 0; i <= 62; i++)
{
watch.Restart();
for (int j = 0; j < 12; j++)
sut.Walk();
watch.Stop();
Console.Write($"{i.ToString("00")}-{watch.ElapsedMilliseconds} ");
if (i != 0) // don't include JIT cost in the results
resultsMs.Add(watch.ElapsedMilliseconds);
if (i % 31 == 0)
Console.WriteLine();
}
Console.WriteLine();
Console.WriteLine($"Iterations 01-31: Avg: {resultsMs.Take(31).Average().ToString("#.00")}");
Console.WriteLine($"Iterations 32-62: Avg: {resultsMs.Skip(31).Average().ToString("#.00")}");
}
}
public class StackWalk
{
public static int InnerIterationCount = 1000;
public void Walk() => A(5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int A(int a) => B(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int B(int a) => C(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int C(int a) => D(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int D(int a) => E(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int E(int a) => F(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int F(int a) => G(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int G(int a) => H(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int H(int a) => I(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int I(int a) => J(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int J(int a) => K(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int K(int a) => L(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int L(int a) => M(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int M(int a) => N(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int N(int a) => O(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int O(int a) => P(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int P(int a) => Q(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int Q(int a) => R(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int R(int a) => S(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int S(int a) => T(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int T(int a) => U(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int U(int a) => V(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int V(int a) => W(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int W(int a) => X(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int X(int a) => Y(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int Y(int a) => Z(a + 5);
[MethodImpl(MethodImplOptions.NoInlining)] private static int Z(int a)
{
for (int i = 0; i < InnerIterationCount; i++)
GC.Collect(0);
return 55;
}
}
}
.NET Core 3.0.0-preview8-27919-09 X64 Microsoft Windows 10.0.18362
GCSettings.IsServerGC=False GCSettings.LatencyMode=Interactive
COMPlus_TieredCompilation=0
00-206
01-210 02-209 03-206 04-212 05-211 06-206 07-206 08-202 09-206 10-206 11-205 12-206 13-205 14-206 15-204 16-200 17-199 18-200 19-199 20-200 21-199 22-199 23-200 24-201 25-200 26-200 27-199 28-200 29-200 30-200 31-201
32-200 33-200 34-200 35-200 36-200 37-201 38-200 39-200 40-199 41-200 42-200 43-199 44-200 45-200 46-200 47-199 48-200 49-181 50-177 51-176 52-176 53-177 54-187 55-186 56-186 57-187 58-190 59-196 60-189 61-189 62-186
Iterations 01-31: Avg: 203.13
Iterations 32-62: Avg: 192.94
.NET Core 3.0.0-preview8-27919-09 X64 Microsoft Windows 10.0.18362
GCSettings.IsServerGC=False GCSettings.LatencyMode=Interactive
COMPlus_TieredCompilation=1
00-229
01-220 02-290 03-233 04-220 05-234 06-254 07-265 08-279 09-224 10-216 11-213 12-217 13-215 14-211 15-210 16-211 17-213 18-237 19-210 20-210 21-210 22-211 23-216 24-216 25-212 26-211 27-211 28-211 29-211 30-264 31-238
32-239 33-238 34-237 35-280 36-238 37-293 38-233 39-233 40-232 41-233 42-233 43-209 44-228 45-229 46-227 47-232 48-234 49-236 50-264 51-236 52-210 53-210 54-208 55-207 56-207 57-208 58-207 59-208 60-208 61-207 62-207
Iterations 01-31: Avg: 225.58
Iterations 32-62: Avg: 228.10
I want to understand if this is a well-known trade-off that comes with TieredJIT or something that we should improve.
/cc @kouvel
I think that it does run enough iterations to hit Tier 1.
I should have been more precise: The test is keeping the runtime suspended pretty much the whole time. Tiered JIT that runs on background threads is making very little progress (on my machine at least) because of the background threads are suspended most of time as well.
Try adding Thread.Sleep after the Console.WriteLine
to your handwritten benchmark to unblock the background threads:
...
Console.Write($"{i.ToString("00")}-{watch.ElapsedMilliseconds} ");
Thread.Sleep(10);
...
You should see completely different numbers once you do that.
It looks like some of the gap is due to FlushProcessWriteBuffers()
taking longer when the background JIT thread is active.
TC=0
Name Inc % Inc Exc % Exc Fold When First Last
coreclr!ThreadSuspend::SuspendRuntime 4.6 381 0.0 3 0 00000000000000000000000000000000 24.919 8,312.466
+ ntdll!ZwFlushProcessWriteBuffers 1.7 142 0.7 57 0 0000000o00000o00000o0o00o00000o0 28.923 8,287.276
+ kernelbase!GetThreadPriority 1.6 132 0.0 2 0 0oo0000000o00000000oo000000_0_o0 24.919 8,302.387
+ kernelbase!ResetEvent 1.2 103 0.0 1 0 0o000o0000oo00000_00o0o00o_o00o0 50.043 8,312.466
+ coreclr!StressLog::LogOn 0.0 1 0.0 1 0 ________________o_______________ 4,339.251 4,340.251
TC=1
Name Inc % Inc Exc % Exc Fold When First Last
coreclr!ThreadSuspend::SuspendRuntime 7.0 621 0.0 3 0 00000010000000000000000001100001 20.467 8,378.572
+ ntdll!ZwFlushProcessWriteBuffers 3.7 327 0.4 37 0 00000o000000o00o0000000000000000 40.425 8,378.572
+ kernelbase!GetThreadPriority 1.9 169 0.0 1 0 00000000o0000000000000o0o000o000 20.467 8,366.728
+ kernelbase!ResetEvent 1.3 117 0.0 0 0 0000000o00oo000o0000000oo0000o0o 41.523 8,253.240
+ coreclr!Thread::SuspendThread 0.0 3 0.0 0 0 ______o___________________o__o__ 1,584.920 7,604.332
+ coreclr!ThreadStore::GetAllThreadList 0.0 2 0.0 2 0 __________o________o____________ 2,680.105 5,162.297
Even if the thread is just spinning and not jitting anything, there is a noticeable increase in time. Putting the background thread to sleep seems to eliminate most of the gap even when everything is tier 0. With the suspends it takes a while for the background JIT thread to become stably inactive.
Probably for now an easy workaround would be to increase warmup iterations for this test. A better workaround may be to force more aggressive tiering-up or for now to disable tiering for the test. Is there a way to send environment variables to out-of-proc runs to configure the runtime?
The PerfLabTests.StackWalk.Walk is very unique, it uses
GC.Collect(0)
to test the performance of stack walking. So the regression can be related to stack walking or GC.The benchmark was run using BenchmarkDotNet which by default runs the benchmarks using default GC settings for console app.
PerfLabTests.StackWalk.Walk
/cc @danmosemsft @janvorli @Maoni0 @billwert @DrewScoggins