dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.52k stars 4.53k forks source link

[Performance] Regression in multiple scenarios, Linux, more significant on ARM64 #102832

Closed sebastienros closed 11 hours ago

sebastienros commented 1 month ago

image

Windows is not impacted NativeAOT is not impacted Visible on x86, but even more on ARM64 with 28 cores, not visible when running with 80 cores on ARM64.

Minimal commits range: https://github.com/dotnet/runtime/compare/70375162014e...56d7e5d80ed6

Command lines to repro using Plaintext MVC on Ampere with 28 cores:

Latency ~ 0.9, RPS ~ 4.2M

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/azure.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/steadystate.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --scenario mvc --profile arm-lin-28-app --profile intel-load2-load --application.framework net9.0 --application.collectDependencies true --application.options.collectCounters true --load.options.reuseBuild true --application.aspNetCoreVersion 9.0.0-preview.5.24256.2 --application.sdkVersion 9.0.100-preview.5.24267.1 --application.runtimeVersion 9.0.0-preview.5.24259.7

Latency ~ 1.3ms, RPS ~ 3.9M

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/azure.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/steadystate.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --scenario mvc --profile arm-lin-28-app --profile intel-load2-load --application.framework net9.0 --application.collectDependencies true --application.options.collectCounters true --load.options.reuseBuild true --application.aspNetCoreVersion 9.0.0-preview.5.24256.2 --application.sdkVersion 9.0.100-preview.5.24267.1 --application.runtimeVersion 9.0.0-preview.5.24260.2
jkotas commented 1 month ago

Minimal commits range: https://github.com/dotnet/runtime/compare/70375162014e...56d7e5d80ed6

https://github.com/dotnet/runtime/pull/101782 is the most likely cause of the regression. cc @VSadov

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

MichalStrehovsky commented 4 weeks ago

Interestingly, the perf infra now reports pretty much identical numbers between Native AOT and CoreCLR both in terms of latency and throughput. Either it's a coincidence, or this also found something interesting for native AOT (given the suspected PR brought something from native AOT to CoreCLR):

image

VSadov commented 3 weeks ago

Also interesting that Windows was not impacted either way. Also Linux ARM with 80 cores was not impacted. So basically, only Linux with 28 cores.

VSadov commented 3 weeks ago

The regression is caused by my PR - that comes just from comparing before/after bits.

What part actually causes it is not yet clear. Even completely disabling async suspension in "before" bits (not sending interrupt signals at all) does not result in a regression. The test appears to be not sensitive to that part.

Working with signals is the most obvious part where Windows and Linux differ. Most other changes in the PR would have similar effects on either OS.

VSadov commented 3 weeks ago

My PR turned blocking wait with timeout into busy-wait. That allowed for much better worst case latencies when asynchronous suspension is involved. Especially on windows where we could wait for 16ms between hijacking retries when some thread is not cooperating - leading to long delays. We are not going back to that. In terms of worst case the change is a big improvement.

TE benchmarks are somewhat unusual programs though. They do not benefit from improvements in async suspension, since they mostly suspend synchronously. Calls to runtime, IO, native APIs, lots of locking,... - the threads have a lot of opportunities to see if we want to suspend and can do that cooperatively even without hijacking.

On the other hand when the app runs at close to 100% capacity, any spinning is harmful as it takes away cycles form the app. In particular it is harmful if we take cycles from the actual threads that we try to suspend, as they cannot self-suspend while we use their core.
We do not need to consume that many cycles though when doing multi-microsecond waits, not on Linux at least, as submillisecond sleep is available and reliable.

I am testing a fix. It works for the scenario reported (tested on ARM64), but I want to do more testing on other scenarios and combinations - to be sure it works as expected in other cases.