Closed sebastienros closed 11 hours ago
Minimal commits range: https://github.com/dotnet/runtime/compare/70375162014e...56d7e5d80ed6
https://github.com/dotnet/runtime/pull/101782 is the most likely cause of the regression. cc @VSadov
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Interestingly, the perf infra now reports pretty much identical numbers between Native AOT and CoreCLR both in terms of latency and throughput. Either it's a coincidence, or this also found something interesting for native AOT (given the suspected PR brought something from native AOT to CoreCLR):
Also interesting that Windows was not impacted either way. Also Linux ARM with 80 cores was not impacted. So basically, only Linux with 28 cores.
The regression is caused by my PR - that comes just from comparing before/after bits.
What part actually causes it is not yet clear. Even completely disabling async suspension in "before" bits (not sending interrupt signals at all) does not result in a regression. The test appears to be not sensitive to that part.
Working with signals is the most obvious part where Windows and Linux differ. Most other changes in the PR would have similar effects on either OS.
My PR turned blocking wait with timeout into busy-wait. That allowed for much better worst case latencies when asynchronous suspension is involved. Especially on windows where we could wait for 16ms between hijacking retries when some thread is not cooperating - leading to long delays. We are not going back to that. In terms of worst case the change is a big improvement.
TE benchmarks are somewhat unusual programs though. They do not benefit from improvements in async suspension, since they mostly suspend synchronously. Calls to runtime, IO, native APIs, lots of locking,... - the threads have a lot of opportunities to see if we want to suspend and can do that cooperatively even without hijacking.
On the other hand when the app runs at close to 100% capacity, any spinning is harmful as it takes away cycles form the app. In particular it is harmful if we take cycles from the actual threads that we try to suspend, as they cannot self-suspend while we use their core.
We do not need to consume that many cycles though when doing multi-microsecond waits, not on Linux at least, as submillisecond sleep is available and reliable.
I am testing a fix. It works for the scenario reported (tested on ARM64), but I want to do more testing on other scenarios and combinations - to be sure it works as expected in other cases.
Windows is not impacted NativeAOT is not impacted Visible on x86, but even more on ARM64 with 28 cores, not visible when running with 80 cores on ARM64.
Minimal commits range: https://github.com/dotnet/runtime/compare/70375162014e...56d7e5d80ed6
Command lines to repro using Plaintext MVC on Ampere with 28 cores:
Latency ~ 0.9, RPS ~ 4.2M
Latency ~ 1.3ms, RPS ~ 3.9M