Open EgorBo opened 1 month ago
On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)
^ arm64
same sub-graph on x64:
On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)
It's much more complicated; frankly I'm pleased it's only 2x formatting an Int32. https://github.com/dotnet/runtime/blob/35b94da00d4a79de3252ba79b6d3942cc3716405/src/libraries/System.Private.CoreLib/src/System/Globalization/DateTimeFormat.cs#L1439-L1513 But if you have ideas for significantly improving it, that'd be great.
We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them
I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.
We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them
I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.
Any ideas what that could be? We do rely on a random for e.g. PGO, but I'd expect arm to be also affected if that was the major factor.
Here is a diff example between two x64 runs:
I wonder - why do we see OnTryWrite twice here, is it recursive?
A bit of a context how the perf is collected: I ask BDN to run longer, wait 15 sec and attach with perf
to it, collect the samples for 8 seconds and quit. Hence, it's expected to be a steady state for such trivial benchmarks.
DOTNET_PerfMapEnabled=1 DOTNET_EnableWriteXorExecute=0 nohup $DIR_WORK/core_root_$1/corerun $TFM_PATH/linux-$ARCH/publish/benchapp.dll -i --filter "*" --noForcedGCs --noOverheadEvaluation --disableLogFile --maxWarmupCount 8 --minIterationCount 150000 --maxIterationCount 200000 -a perfarts 1> $DIR_LOGS/profiler.out 2> $DIR_LOGS/profiler.err &
sleep 15
perf record $PERF_RECORD_ARGS -k 1 -g -F 49999 -p $(pgrep corerun) sleep 8
perf inject --input perf.data --jit --output perfjit.data
perf report --input perfjit.data --no-children --percent-limit 5 --stdio > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1_functions.txt
perf annotate --stdio -i perfjit.data --percent-limit 5 -M intel > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1.asm
pkill corerun || true
I don't have any immediate speculation on what the cause is. My initial thought is that while the architecture difference is a little surprising this might be a "correlation != causation" scenario and it will turn out the culprit is something quite different. A few suggestions that might help make progress:
Hope that helps!
I am not sure if this is the same issue, but when I was working on adding --profiler perf
support to BDN (which is just a wrapper for perfcollect) I've noticed that on x64 the profiles were often detached and on arm64 everything worked as expected.
By detached I mean for method A calling method B and then C, the profile showed "A calling B" on the root level and "C" just next to them (rather than B calling C).
Example:
I've not tried it with .NET 9 so the issue might be gone.
IIRC @janvorli knew the reason behind this particular difference.
@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?
@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?
@adamsitnik I tried that first, but decided to have a more explicit control over raw perf
e.g. test different events, frequencies, get annotated asm code. E.g. I am not sure it's possible to extract raw asm annotated with % for the top 5 hottest functions in a profile with your profiler.
Overall, there are 2 problems currently:
1) Some frames look "dettached" on x64, today I've ran a quick benchmark for the Write-Barrier issue and just compare these two flamegraphs: x64 and arm64 - for some reason, the write barrier is not "on top" of void WB()
function as expected (it is on arm64) - see original C# code.
2) General noisy differences between runs - I think it's indeed caused by tiered compilation, after I enabled DOTNET_PerfMapShowOptimizationTiers=1
I noticed mismatches between tier names, so I bumped the timeout perf
has to wait before collecting the samples (to 30 sec). Alternatively, users can just disable tiered compilation.
I have built a toy bot to run arbitrary benchmarks on Azure Linux VMs to then report the results back to PRs/issues the bot was invoked from. It also runs native Linux
perf
tool to collect traces/flamegraphs on demand. What I noticed from the beginning is that the quality of flamegraphs on x64 is often poor and a bit random between runs compared to exactly the same config on arm64.Example: https://github.com/dotnet/runtime/pull/105593#issuecomment-2258072263
So here is how flamegraphs look like between sequential runs on arm64 (which is Ampere Altra):
if you open these graphs and try to switch between tabs in your browser, you will notice almost no difference between "run 0" and "run 1" as expected.
On x64 (Amd Milano) the picture is a bit different:
(presumably, with less aggressive/disabled inlining it's a lot worse)
the graphs are a lot "random" between the runs. We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them, although, we have an issue to eventually fix it: https://github.com/dotnet/runtime/issues/88823 and https://github.com/dotnet/runtime/issues/35274). Should we never use that optimization when PerfMap is enabled? (or even introduce a new knob since it's not only
perf
specific).cc @jkotas @dotnet/dotnet-diag