Open cincuranet opened 1 year ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.
Author: | cincuranet |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `tenet-performance-benchmarks`, `untriaged`, `area-Codegen-Interpreter-mono` |
Milestone: | - |
Is this issue still relevant?
It is still very slow. But we also ramped up timeouts so it finishes nowadays.
Recent run:
[2024/01/02 22:12:56][INFO] // Benchmark: Perf_Timer.SynchronousContention: Job-CLJGNG(PowerPlanMode=00000000-0000-0000-0000-000000000000, Toolchain=CoreRun, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1)
[2024/01/02 22:12:56][INFO] // *** Execute ***
[2024/01/02 22:12:56][INFO] // Launch: 1 / 1
[2024/01/02 22:12:56][INFO] // Execute: /home/helixbot/work/A33B0927/p/dotnet-mono/shared/Microsoft.NETCore.App/9c7003c2-5923-4365-808a-511761d019fa/corerun Job-CLJGNG.dll --anonymousPipes 243 244 --benchmarkName System.Threading.Tests.Perf_Timer.SynchronousContention --job "PowerPlanMode=00000000-0000-0000-0000-000000000000, Toolchain=CoreRun, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1" --benchmarkId 197 in /home/helixbot/work/A33B0927/w/AB87093D/e/performance/artifacts/bin/MicroBenchmarks/Release/net9.0/Job-CLJGNG/bin/Release/net9.0/publish
[2024/01/02 22:12:56][INFO] // Failed to set up high priority (Permission denied). In order to run benchmarks with high priority, make sure you have the right permissions.
[2024/01/02 22:12:56][INFO] // BeforeAnythingElse
[2024/01/02 22:12:56][INFO]
[2024/01/02 22:12:56][INFO] // Benchmark Process Environment Information:
[2024/01/02 22:12:56][INFO] // BenchmarkDotNet v0.13.11-nightly.20231126.107
[2024/01/02 22:12:56][INFO] // Runtime=.NET 9.0.0 (42.42.42.42424) using MonoVM, Arm64 AOT
[2024/01/02 22:12:56][INFO] // GC=Non-concurrent Workstation
[2024/01/02 22:12:56][INFO] // HardwareIntrinsics=
[2024/01/02 22:12:56][INFO] // Job: Job-DFDKRN(PowerPlanMode=00000000-0000-0000-0000-000000000000, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1)
[2024/01/02 22:12:56][INFO]
[2024/01/02 22:12:56][INFO] OverheadJitting 1: 1 op, 34560.00 ns, 34.5600 us/op
[2024/01/02 22:20:16][INFO] WorkloadJitting 1: 1 op, 440022687380.00 ns, 440.0227 s/op
[2024/01/02 22:20:16][INFO]
[2024/01/02 22:27:31][INFO] WorkloadWarmup 1: 1 op, 435051974977.00 ns, 435.0520 s/op
[2024/01/02 22:27:31][INFO]
[2024/01/02 22:27:31][INFO] // BeforeActualRun
[2024/01/02 22:34:47][INFO] WorkloadActual 1: 1 op, 435324910483.00 ns, 435.3249 s/op
[2024/01/02 22:42:02][INFO] WorkloadActual 2: 1 op, 435911093850.00 ns, 435.9111 s/op
[2024/01/02 22:49:18][INFO] WorkloadActual 3: 1 op, 435123083406.00 ns, 435.1231 s/op
[2024/01/02 22:56:34][INFO] WorkloadActual 4: 1 op, 436278624259.00 ns, 436.2786 s/op
[2024/01/02 23:03:50][INFO] WorkloadActual 5: 1 op, 436325228561.00 ns, 436.3252 s/op
[2024/01/02 23:11:07][INFO] WorkloadActual 6: 1 op, 436417129879.00 ns, 436.4171 s/op
[2024/01/02 23:18:22][INFO] WorkloadActual 7: 1 op, 435634211369.00 ns, 435.6342 s/op
[2024/01/02 23:25:38][INFO] WorkloadActual 8: 1 op, 436053297849.00 ns, 436.0533 s/op
[2024/01/02 23:32:53][INFO] WorkloadActual 9: 1 op, 435081302030.00 ns, 435.0813 s/op
[2024/01/02 23:40:09][INFO] WorkloadActual 10: 1 op, 435165339859.00 ns, 435.1653 s/op
[2024/01/02 23:47:25][INFO] WorkloadActual 11: 1 op, 436471928449.00 ns, 436.4719 s/op
[2024/01/02 23:54:41][INFO] WorkloadActual 12: 1 op, 435597729415.00 ns, 435.5977 s/op
[2024/01/03 00:01:57][INFO] WorkloadActual 13: 1 op, 436273359150.00 ns, 436.2734 s/op
[2024/01/03 00:09:14][INFO] WorkloadActual 14: 1 op, 437006510767.00 ns, 437.0065 s/op
[2024/01/03 00:16:31][INFO] WorkloadActual 15: 1 op, 436689159219.00 ns, 436.6892 s/op
[2024/01/03 00:16:31][INFO]
[2024/01/03 00:16:31][INFO] // AfterActualRun
[2024/01/03 00:23:47][INFO] WorkloadResult 1: 1 op, 435324910483.00 ns, 435.3249 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 2: 1 op, 435911093850.00 ns, 435.9111 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 3: 1 op, 435123083406.00 ns, 435.1231 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 4: 1 op, 436278624259.00 ns, 436.2786 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 5: 1 op, 436325228561.00 ns, 436.3252 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 6: 1 op, 436417129879.00 ns, 436.4171 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 7: 1 op, 435634211369.00 ns, 435.6342 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 8: 1 op, 436053297849.00 ns, 436.0533 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 9: 1 op, 435081302030.00 ns, 435.0813 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 10: 1 op, 435165339859.00 ns, 435.1653 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 11: 1 op, 436471928449.00 ns, 436.4719 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 12: 1 op, 435597729415.00 ns, 435.5977 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 13: 1 op, 436273359150.00 ns, 436.2734 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 14: 1 op, 437006510767.00 ns, 437.0065 s/op
[2024/01/03 00:23:47][INFO] WorkloadResult 15: 1 op, 436689159219.00 ns, 436.6892 s/op
[2024/01/03 00:23:47][INFO] // GC: 3011 1 1 12160027808 1
[2024/01/03 00:23:47][INFO] // Threading: 80 18 1
[2024/01/03 00:23:47][INFO]
[2024/01/03 00:23:47][INFO] // AfterAll
[2024/01/03 00:23:47][INFO] // Benchmark Process 1134683 has exited with code 0.
[2024/01/03 00:23:47][INFO]
[2024/01/03 00:23:47][INFO] Mean = 435.957 s, StdErr = 0.157 s (0.04%), N = 15, StdDev = 0.609 s
[2024/01/03 00:23:47][INFO] Min = 435.081 s, Q1 = 435.461 s, Median = 436.053 s, Q3 = 436.371 s, Max = 437.007 s
[2024/01/03 00:23:47][INFO] IQR = 0.910 s, LowerFence = 434.097 s, UpperFence = 437.736 s
[2024/01/03 00:23:47][INFO] ConfidenceInterval = [435.306 s; 436.608 s] (CI 99.9%), Margin = 0.651 s (0.15% of Mean)
[2024/01/03 00:23:47][INFO] Skewness = -0.05, Kurtosis = 1.61, MValue = 2
Ok, thanks. I will try to reproduce it again.
@cincuranet Is this still relevant?
@sblom @LoopedBard3 @caaavik-msft @DrewScoggins PTAL
This does still seem to be a Scenario with much higher run times. This is the average result of the last 30 days for the AsynchronousContention test sliced by runconfig and queue, and sorted slowest to fastest: | RunConfiguration | Queue | Result (sec) |
---|---|---|---|
{"CompilationMode":"tiered","RunKind":"micro_mono","LLVM":"false","MonoInterpreter":"true","MonoAOT":"false"} | Ubuntu.2204.Arm64.Perf | 442.71767679941189 | |
{"CompilationMode":"Tiered","RunKind":"micro","PGOType":"nodynamicpgo"} | Windows.Server.Arm64.Perf | 331.45373260975771 | |
{"CompilationMode":"tiered","RunKind":"micro","R2RType":"nor2r"} | Ubuntu.2204.Arm64.Perf | 329.68414371355146 | |
{"CompilationMode":"tiered","RunKind":"micro","PGOType":"nodynamicpgo"} | Ubuntu.2204.Arm64.Perf | 329.42461407338243 | |
{"CompilationMode":"tiered","RunKind":"micro"} | Ubuntu.2204.Arm64.Perf | 329.14762180539196 | |
{"CompilationMode":"Tiered","RunKind":"micro","R2RType":"nor2r"} | Windows.Server.Arm64.Perf | 327.97486255683566 | |
{"CompilationMode":"Tiered","RunKind":"micro"} | Windows.Server.Arm64.Perf | 326.56740018672349 | |
{"CompilationMode":"Tiered","RunKind":"micro","ExperimentName":"jitoptrepeat"} | Windows.11.Amd64.Viper.Perf | 20.132713217645968 | |
{"CompilationMode":"Tiered","RunKind":"micro"} | Windows.11.Amd64.Viper.Perf | 19.218769350140207 | |
{"CompilationMode":"tiered","RunKind":"micro","ExperimentName":"jitoptrepeat"} | Ubuntu.2204.Amd64.Viper.Perf | 18.377636735868517 | |
{"CompilationMode":"tiered","RunKind":"micro"} | Ubuntu.2204.Amd64.Viper.Perf | 18.336005699945044 | |
{"CompilationMode":"tiered","RunKind":"micro_mono","LLVM":"false","MonoInterpreter":"true","MonoAOT":"false"} | Ubuntu.2204.Amd64.Tiger.Perf | 10.494778979082861 | |
{"CompilationMode":"tiered","RunKind":"micro_mono","LLVM":"false","MonoInterpreter":"false","MonoAOT":"false"} | Ubuntu.2204.Amd64.Tiger.Perf | 8.361800168675801 | |
{"CompilationMode":"Tiered","RunKind":"micro","ExperimentName":"rlcse"} | Windows.11.Amd64.Owl.Perf | 5.4697047239147043 | |
{"CompilationMode":"Tiered","RunKind":"micro"} | Windows.11.Amd64.Owl.Perf | 5.4363181450833817 | |
{"CompilationMode":"tiered","RunKind":"micro"} | Ubuntu.2204.Amd64.Owl.Perf | 4.7835471147354074 | |
{"CompilationMode":"tiered","RunKind":"micro","ExperimentName":"rlcse"} | Ubuntu.2204.Amd64.Owl.Perf | 4.664173323838698 | |
{"CompilationMode":"Tiered","RunKind":"micro","R2RType":"nor2r"} | Windows.11.Amd64.Tiger.Perf | 3.351061666715089 | |
{"CompilationMode":"Tiered","RunKind":"micro"} | Windows.11.Amd64.Tiger.Perf | 3.281750910996307 | |
{"CompilationMode":"Tiered","RunKind":"micro","PGOType":"nodynamicpgo"} | Windows.11.Amd64.Tiger.Perf | 3.1858667317420855 | |
{"CompilationMode":"Tiered","RunKind":"micro"} | Windows.11.Arm64.Surf.Perf | 2.0885703081562714 | |
{"CompilationMode":"tiered","RunKind":"micro","R2RType":"nor2r"} | Ubuntu.2204.Amd64.Tiger.Perf | 1.9943783126421664 | |
{"CompilationMode":"tiered","RunKind":"micro"} | Ubuntu.2204.Amd64.Tiger.Perf | 1.9633307239576763 | |
{"CompilationMode":"tiered","RunKind":"micro"} | alpine.amd64.tiger.perf | 1.9575610596830739 | |
{"CompilationMode":"tiered","RunKind":"micro","PGOType":"nodynamicpgo"} | Ubuntu.2204.Amd64.Tiger.Perf | 1.9388397557867425 | |
{"CompilationMode":"tiered","RunKind":"micro"} | Ubuntu.2204.Amd64 | 0.60148117751757435 |
I think it is clear that Arm64 is still substantially slower than other architectures resulting in longer test runtime.
Benchmarks
Perf_Timer.AsynchronousContention
andPerf_Timer.SynchronousContention
are running very slow in dotnet-runtime-perf-slow on arm64/Mono/Interpreter pipeline (Performance Linux arm64 release mono Interpreter micro_mono perfampere NoJS False
).Trying to reproduce it on similar machine (VM in Azure - Ubuntu 20.04+arm64) was not successful. I and @BrzVlad were not able to build the runtime and run the benchmark on Helix machine.
This is recent run:
It's eventually killed, because pipeline runs out of time.
Below is one from September that finished. But the times are still in 167s.
Similar behavior is for
Perf_Timer.SynchronousContention
.