[Performance] Regression in multiple scenarios, Linux, more significant on ARM64

sebastienros commented 5 months ago

Windows is not impacted NativeAOT is not impacted Visible on x86, but even more on ARM64 with 28 cores, not visible when running with 80 cores on ARM64.

Minimal commits range: https://github.com/dotnet/runtime/compare/70375162014e...56d7e5d80ed6

Command lines to repro using Plaintext MVC on Ampere with 28 cores:

Latency ~ 0.9, RPS ~ 4.2M

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/azure.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/steadystate.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --scenario mvc --profile arm-lin-28-app --profile intel-load2-load --application.framework net9.0 --application.collectDependencies true --application.options.collectCounters true --load.options.reuseBuild true --application.aspNetCoreVersion 9.0.0-preview.5.24256.2 --application.sdkVersion 9.0.100-preview.5.24267.1 --application.runtimeVersion 9.0.0-preview.5.24259.7

Latency ~ 1.3ms, RPS ~ 3.9M

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/azure.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/steadystate.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --scenario mvc --profile arm-lin-28-app --profile intel-load2-load --application.framework net9.0 --application.collectDependencies true --application.options.collectCounters true --load.options.reuseBuild true --application.aspNetCoreVersion 9.0.0-preview.5.24256.2 --application.sdkVersion 9.0.100-preview.5.24267.1 --application.runtimeVersion 9.0.0-preview.5.24260.2

jkotas commented 5 months ago

Minimal commits range: https://github.com/dotnet/runtime/compare/70375162014e...56d7e5d80ed6

https://github.com/dotnet/runtime/pull/101782 is the most likely cause of the regression. cc @VSadov

dotnet-policy-service[bot] commented 5 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

MichalStrehovsky commented 5 months ago

Interestingly, the perf infra now reports pretty much identical numbers between Native AOT and CoreCLR both in terms of latency and throughput. Either it's a coincidence, or this also found something interesting for native AOT (given the suspected PR brought something from native AOT to CoreCLR):

VSadov commented 5 months ago

Also interesting that Windows was not impacted either way. Also Linux ARM with 80 cores was not impacted. So basically, only Linux with 28 cores.

VSadov commented 5 months ago

The regression is caused by my PR - that comes just from comparing before/after bits.

What part actually causes it is not yet clear. Even completely disabling async suspension in "before" bits (not sending interrupt signals at all) does not result in a regression. The test appears to be not sensitive to that part.

Working with signals is the most obvious part where Windows and Linux differ. Most other changes in the PR would have similar effects on either OS.

VSadov commented 5 months ago

My PR turned blocking wait with timeout into busy-wait. That allowed for much better worst case latencies when asynchronous suspension is involved. Especially on windows where we could wait for 16ms between hijacking retries when some thread is not cooperating - leading to long delays. We are not going back to that. In terms of worst case the change is a big improvement.

TE benchmarks are somewhat unusual programs though. They do not benefit from improvements in async suspension, since they mostly suspend synchronously. Calls to runtime, IO, native APIs, lots of locking,... - the threads have a lot of opportunities to see if we want to suspend and can do that cooperatively even without hijacking.

On the other hand when the app runs at close to 100% capacity, any spinning is harmful as it takes away cycles form the app. In particular it is harmful if we take cycles from the actual threads that we try to suspend, as they cannot cooperatively self-suspend while we use their core.
We do not need to consume that many cycles though when doing multi-microsecond waits, not on Linux at least, as submillisecond sleep is available and reliable.

I am testing a fix. It works for the scenario reported (tested on ARM64), but I want to do more testing on other scenarios and combinations - to be sure it works as expected in other cases.

VSadov commented 4 months ago

I see a bump in Stage1 RPS correlating with the fix propagating to the runs. Also a drop in latency. So I assume this is fixed.

sebastienros commented 4 months ago

Thanks, the numbers you shared on the PR have more value to me but seeing them on the chart is nice too.

with the fix propagating to the runs

Are you sure the fix has propagated to the SDK already? Looking at the charts on intel-lin I don't see any change related to it, only from 6/20 which is when DATAs and a perf fix went in.

VSadov commented 4 months ago

I just picked the last two datapoints in Stage1 ampere-lin-28 and clicked on runtime-diff.

Maybe other configs did not produce datapoints with the change. I assume the runtime-diff is the way to see what went in between datapoints.

sebastienros commented 4 months ago

@VSadov that's the correct way.

Could the regression on intel-lin not be related then? Or not improved by your fix?

VSadov commented 4 months ago

Could the regression on intel-lin not be related then? Or not improved by your fix?

That is possible. Also possible that the other changes since the regression fixed the regression or negated the impact of my fix. Ampere-28 seems more sensitive to this, possibly because it runs closer to 100%.

However, when I pick the latest point in the Stage1 chart for intel-lin, my change is not in the list. Could be just sampling issue with the runs vs. commits.

The last included commit for lin-intel is just before mine. So maybe the fix is not yet in for that config.

sebastienros commented 4 months ago

The last included commit for lin-intel is just before mine. So maybe the fix is not yet in for that config.

That's reassuring for now, let's wait for a few more runs then.

dotnet / runtime

[Performance] Regression in multiple scenarios, Linux, more significant on ARM64 #102832