yieldprocessornormalized appears to be incorrect in some scenarios

tannergooding commented 6 months ago

This pertains to the logic in https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/yieldprocessornormalized.cpp and where an initial discussion was raised here: https://github.com/dotnet/runtime/pull/99991#discussion_r1536151998

The general premise is that while the logic attempts to be robust and appears to get the results correct on Windows, for at least a subset of machines tested, the same general logic does not return a consistent result for the same hardware on Linux. Additionally, there are some potential scenarios where the logic will be incorrect when accounting for big.LITTLE (or power/efficiency) micro-architectures and where results can become skewed due to external factors such as processor downclock (energy savings), processing boosting, or even context switches caused by the Operating System.

tannergooding commented 6 months ago

The simplest example is to test the results on a well documented system, such as an AMD Zen4 chip where pause is documented to take exactly 64 cycles. On Windows, the computed time is typically around 14ns (which is spot on for a 4.5GHz base clock processor). However, if you were to actually log the min/max times and the processor ID as seen over all calls, you'd see something similar to the following:

min: 12.76595744680851 8
max: 12.76595744680851 8
max: 13.924050632911392 8
min: 12.658227848101266 8
min: 11.702127659574469 8
max: 16.949152542372882 8
max: 24.390243902439025 30
max: 26.829268292682926 26
max: 81.69014084507042 8
max: 142.85714285714286 8
max: 182.33333333333334 0

This variance is caused by processor frequency changes as well as by OS context switches, but also shows that the core being tested may not be consistent which can negatively impact other cores.

On Linux, however, the same code is returning drastically different measurements despite the knowledge that pause will take exactly 64 cycles:

min: 8.333333333333334 23
max: 8.333333333333334 23
min: 7.046979865771812 23
min: 6.709677419354839 23
min: 6.580645161290323 23
min: 6.538461538461538 23
max: 9.0625 23
max: 10.412371134020619 23
max: 40.796460176991154 23
max: 54.01639344262295 23
max: 102.53571428571429 23
max: 181.78571428571428 23
max: 182.14285714285714 23
max: 182.33333333333334 23
min: 6.5 23
min: 6.4743589743589745 23
min: 6.3354037267080745 23
min: 6.31875 23
min: 6.3125 23
min: 6.273291925465839 22

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

tannergooding commented 6 months ago

Similar discrepancies are seen on Arm64 for devices like the Volterra (Windows DevKit) or a Raspberry PI.

However, for Arm64 in particular it is worth noting that we currently use yield. The architecture manuals actually specify that yield may be a nop on many systems and in practice this can be visibly seen in the measurements:

min: 0.4478280340349306 5
max: 0.4478280340349306 6
min: 0.42662116040955633 6
max: 0.5071461502996772 5
min: 0.41736227045075125 6
min: 0.3985651654045436 5
max: 2.345415778251599 6
max: 2.487562189054726 5
min: 0.3838771593090211 6
min: 0.37091988130563797 5
max: 36.47746243739566 6
max: 89.28571428571429 5
max: 173.2193732193732 6
max: 174.28571428571428 0
max: 182.33333333333334 6

If you were to replace the yield with a true nop, you'd actually see identical measurements.

Many other runtimes have correspondingly switched to emitting isb instead due to the fact that yield is behaving as a nop on a very large amount of real hardware. This has a latency more around 9-10ns, which is much more similar to the x86/x64 pause instruction.

Additionally, Armv8 has an optional feature (which is now required in Armv8.7+) called Wait for Event and Wait for Event with Timeout. This new feature is intentionally designed to be used with spin locks and comes with explicit documented samples on how to use it correctly. As such, it may be desirable to adjust Arm64 to use wfe if available and otherwise to use isb if the underlying yield reports timings similar to nop.

This should result in better energy efficiency, fewer instructions executed, etc.

tannergooding commented 6 months ago

x64 similarly has newer optional functionality in the form of tpause (timed pause) which is preferred over pause on newer hardware and which allows explicitly opting for better perf or better energy efficiency. -- It also has umwait and umonitor which allow similar efficient waiting for an address to be read/written and which can be used for efficient semaphores or other optimizations.

kouvel commented 6 months ago

@tannergooding could you please share the test case that yields the measurements in https://github.com/dotnet/runtime/issues/100242#issuecomment-2018447975?

tannergooding commented 6 months ago

It's just modifying YieldProcessorNormalization::PerformMeasurement() to print out the data to the console. But the same general premise can be observed from the FireEtwYieldProcessorMeasurement event or by setting a breakpoint.

Where the processor ID can be queried via GetCurrentProcessorNumberEx() (Win32) or sched_getcpu (Linux).

kouvel commented 6 months ago

It's just modifying YieldProcessorNormalization::PerformMeasurement() to print out the data to the console. But the same general premise can be observed from the FireEtwYieldProcessorMeasurement event or by setting a breakpoint.

Where the processor ID can be queried via GetCurrentProcessorNumberEx() (Win32) or sched_getcpu (Linux).

Some variance is to be expected in normal situations, but are you saying that this variance occurs even with nothing else happening on the machine?

tannergooding commented 6 months ago

Yes.

I'm consistently seeing Linux report numbers that are lower than Windows (and therefore also lower than the actual hardware documented time) and the actual variance to be quite substantial in some cases, really depending on the previous power state of the CPU.

The Intel Software Optimization manual then documents that a core license change (which can occur for heavy AVX2/AVX512 usage as an example) can take up to 500 microseconds and then will spend at least 2ms before it changes back to the higher clock speed. It also documents that many operating systems may use a time constant on the order of 10-100ms to detect processor workload demand and thus change the explicitly requested frequency.

These documented timings show that there can be a pretty substantial window under which pause latency can be mismeasured for a given core. It also doesn't factor in the implications around heterogenous architectures (where pause may have different latencies across different cores).

On Windows, we do get eventual consistency since the normalization is semi-regularly run for a thread as needed, but it's still not amazing and can get skewed if context switches happen and the core a thread is on gets changed.

tannergooding commented 6 months ago

I believe @EgorBo saw similar numbers as I did when he was looking at some Linux related perf traces for Arm64.

EgorBo commented 6 months ago

Many other runtimes have correspondingly switched to emitting isb instead due to the fact that yield is behaving as a nop on a very large amount of real hardware. This has a latency more around 9-10ns, which is much more similar to the x86/x64 pause instruction.

It was discussed here: https://github.com/dotnet/runtime/pull/92611#issuecomment-1864766668

dotnet / runtime

yieldprocessornormalized appears to be incorrect in some scenarios #100242