dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Quality of native perf profiling on x64 #105690

Open EgorBo opened 1 month ago

EgorBo commented 1 month ago

I have built a toy bot to run arbitrary benchmarks on Azure Linux VMs to then report the results back to PRs/issues the bot was invoked from. It also runs native Linux perf tool to collect traces/flamegraphs on demand. What I noticed from the beginning is that the quality of flamegraphs on x64 is often poor and a bit random between runs compared to exactly the same config on arm64.

Example: https://github.com/dotnet/runtime/pull/105593#issuecomment-2258072263

So here is how flamegraphs look like between sequential runs on arm64 (which is Ampere Altra):

if you open these graphs and try to switch between tabs in your browser, you will notice almost no difference between "run 0" and "run 1" as expected.

On x64 (Amd Milano) the picture is a bit different:

(presumably, with less aggressive/disabled inlining it's a lot worse)

the graphs are a lot "random" between the runs. We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them, although, we have an issue to eventually fix it: https://github.com/dotnet/runtime/issues/88823 and https://github.com/dotnet/runtime/issues/35274). Should we never use that optimization when PerfMap is enabled? (or even introduce a new knob since it's not only perf specific).

cc @jkotas @dotnet/dotnet-diag

EgorBo commented 1 month ago

@SingleAccretion pointed me to these two places:

1) How jit decides when to set the frame: link 2) JIT ignores VM's JIT_FLAG_FRAMED on x64: link

EgorBo commented 1 month ago

On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?) image

^ arm64

same sub-graph on x64:

image

stephentoub commented 1 month ago

On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)

It's much more complicated; frankly I'm pleased it's only 2x formatting an Int32. https://github.com/dotnet/runtime/blob/35b94da00d4a79de3252ba79b6d3942cc3716405/src/libraries/System.Private.CoreLib/src/System/Globalization/DateTimeFormat.cs#L1439-L1513 But if you have ideas for significantly improving it, that'd be great.

jkotas commented 1 month ago

We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them

I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.

EgorBo commented 1 month ago

We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them

I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.

Any ideas what that could be? We do rely on a random for e.g. PGO, but I'd expect arm to be also affected if that was the major factor.

Here is a diff example between two x64 runs:

image

I wonder - why do we see OnTryWrite twice here, is it recursive?

EgorBo commented 1 month ago

A bit of a context how the perf is collected: I ask BDN to run longer, wait 15 sec and attach with perf to it, collect the samples for 8 seconds and quit. Hence, it's expected to be a steady state for such trivial benchmarks.

DOTNET_PerfMapEnabled=1 DOTNET_EnableWriteXorExecute=0 nohup $DIR_WORK/core_root_$1/corerun $TFM_PATH/linux-$ARCH/publish/benchapp.dll -i --filter "*" --noForcedGCs --noOverheadEvaluation --disableLogFile --maxWarmupCount 8 --minIterationCount 150000 --maxIterationCount 200000 -a perfarts 1> $DIR_LOGS/profiler.out 2> $DIR_LOGS/profiler.err &
sleep 15
perf record $PERF_RECORD_ARGS -k 1 -g -F 49999 -p $(pgrep corerun) sleep 8
perf inject --input perf.data --jit --output perfjit.data
perf report --input perfjit.data --no-children --percent-limit 5 --stdio > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1_functions.txt
perf annotate --stdio -i perfjit.data --percent-limit 5 -M intel > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1.asm
pkill corerun || true
noahfalk commented 1 month ago

I don't have any immediate speculation on what the cause is. My initial thought is that while the architecture difference is a little surprising this might be a "correlation != causation" scenario and it will turn out the culprit is something quite different. A few suggestions that might help make progress:

Hope that helps!

adamsitnik commented 1 month ago

I am not sure if this is the same issue, but when I was working on adding --profiler perf support to BDN (which is just a wrapper for perfcollect) I've noticed that on x64 the profiles were often detached and on arm64 everything worked as expected.

By detached I mean for method A calling method B and then C, the profile showed "A calling B" on the root level and "C" just next to them (rather than B calling C).

Example:

image

I've not tried it with .NET 9 so the issue might be gone.

IIRC @janvorli knew the reason behind this particular difference.

adamsitnik commented 1 month ago

@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?

EgorBo commented 1 month ago

@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?

@adamsitnik I tried that first, but decided to have a more explicit control over raw perf e.g. test different events, frequencies, get annotated asm code. E.g. I am not sure it's possible to extract raw asm annotated with % for the top 5 hottest functions in a profile with your profiler.

Overall, there are 2 problems currently: 1) Some frames look "dettached" on x64, today I've ran a quick benchmark for the Write-Barrier issue and just compare these two flamegraphs: x64 and arm64 - for some reason, the write barrier is not "on top" of void WB() function as expected (it is on arm64) - see original C# code.

2) General noisy differences between runs - I think it's indeed caused by tiered compilation, after I enabled DOTNET_PerfMapShowOptimizationTiers=1 I noticed mismatches between tier names, so I bumped the timeout perf has to wait before collecting the samples (to 30 sec). Alternatively, users can just disable tiered compilation.