dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.06k stars 4.69k forks source link

Using UMONITOR, UMWAIT, TPAUSE in CLR and exposing in Intel specific hardware intrinsics #66873

Open simplejackcoder opened 2 years ago

simplejackcoder commented 2 years ago

Summary

x86 based hardware introduced the waitpkg ISA back in 2020 which can be used to better facilitate low-power and low-latency spin-loops.

API Suggestion

namespace System.Runtime.Intrinsics.X86;

[Intrinsic]
[CLSCompliant(false)]
public abstract class WaitPkg : X86Base
{
    public static new bool IsSupported { get; }

    // UMONITOR: void _umonitor(void *address);
    public static unsafe void SetUpUserLevelMonitor(void* address);

    // UMWAIT: uint8_t _umwait(uint32_t control, uint64_t counter);
    public static bool WaitForUserLevelMonitor(uint control, ulong counter);

    // TPAUSE: uint8_t _tpause(uint32_t control, uint64_t counter);
    public static bool TimedPause(uint control, ulong counter);

    [Intrinsic]
    public new abstract class X64 : X86Base.X64
    {
        internal X64() { }

        public static new bool IsSupported { get; }
    }
}

Additional Considerations

There is a model specific register IA32_UMWAIT_CONTROL (MSR 0xE1) which provides additional information. However, model specific registers can only be read by ring 0 (the kernel) and as such this information is not available to user mode programs without the underlying OS exposing an explicit API. As such, this information is not surfaced to the end user.

This information is not strictly pertinent to the user either and would not normally influence their use of the APIs. For example, if IA32_UMWAIT_CONTROL[0] is 1, it simply means that a user call of TimedPause where control == 0 will be treated as control == 1: Bit Value State Name Wakeup Time Power Savings Other Benefits
bit[0] = 0 C0.2 Slower Larger Improves performance of the other SMT thread(s) on the same core
bit[0] = 1 C0.1 Faster Smaller N/A
bits[31:1] N/A N/A N/A Reserved

Likewise, if the user specified counter is larger than IA32_UMWAIT_CONTROL[31:2] then TimedPause returns true indicating that the pause ended due to expiration of the operating system time-limit rather than reaching/exceeding the specified counter (returns false). The same applies to WaitForUserLevelMonitor.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details
Use cases * Expose in the hardware intrinsics library * Threads * Sleep * GC write watch
Author: simplejackcoder
Assignees: -
Labels: `area-System.Runtime.Intrinsics`, `untriaged`
Milestone: -
tannergooding commented 2 years ago

At the very least, this proposal needs to be updated to follow the API Proposal outline, similarly to https://github.com/dotnet/runtime/issues/66467

These instructions are available in user-mode and don't appear to have any oddities that would prevent their support in the JIT. waitpkg is a relatively new ISA that I believe is only supported in Tremont, Alder Lake, Sapphire Rapids at the moment and is currently Intel only.

It might be interesting to see if @stephentoub, @jkotas has anywhere this could be used in-box. Things like working with the GC would likely not be easy to support and like pause/yield these are likely difficult to use APIs. It might be better to see if the functionality could be implicitly used where possible or if a more general set of "efficient/xplat" APIs covering this functionality is a "better idea".

For reference:

The C++ signatures for these are:

Rust provides similarly named APIs.

ghost commented 2 years ago

This issue has been marked needs-author-action since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.

jkotas commented 2 years ago

It might be interesting to see if @stephentoub, @jkotas has anywhere this could be used in-box

It would be interesting to experiment with replacing the lock spin loops using these intrinsics. It should provide better overall performance, especially on machines with many cores.

The common locks are implemented in C/C++ in CoreCLR today, so we would need to reimplement them in C# first before the managed intrinsics can be used for those.

ghost commented 2 years ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

deeprobin commented 2 years ago

Can somebody create an API-Shape for this Proposal?

MineCake147E commented 9 months ago

Can somebody create an API-Shape for this Proposal?

I came up with something like this:

namespace System.Runtime.Intrinsics.X86;

[Intrinsic]
[CLSCompliant(false)]
public abstract class WaitPkg : X86Base
{
    public static new bool IsSupported { get; }
    public static unsafe void SetUpUserLevelMonitorAddress(void* address);
    public static byte WaitForUserLevelMonitor(uint control, ulong counter);
    public static byte TimedPause(uint control, ulong counter);
    [Intrinsic]
    public new abstract class X64 : X86Base.X64
    {
        internal X64() { }

        public static new bool IsSupported { get; }
    }
}

I hope it helps.

tannergooding commented 9 months ago

I've updated it loosely based on the above. Made a couple tweaks and gave an explanation of why GetMaximumWaitTime and GetIsC02Supported can't be exposed

bartonjs commented 8 months ago

Video

Looks good as proposed.

namespace System.Runtime.Intrinsics.X86;

[Intrinsic]
[CLSCompliant(false)]
public abstract class WaitPkg : X86Base
{
    public static new bool IsSupported { get; }

    // UMONITOR: void _umonitor(void *address);
    public static unsafe void SetUpUserLevelMonitor(void* address);

    // UMWAIT: uint8_t _umwait(uint32_t control, uint64_t counter);
    public static bool WaitForUserLevelMonitor(uint control, ulong counter);

    // TPAUSE: uint8_t _tpause(uint32_t control, uint64_t counter);
    public static bool TimedPause(uint control, ulong counter);

    [Intrinsic]
    public new abstract class X64 : X86Base.X64
    {
        internal X64() { }

        public static new bool IsSupported { get; }
    }
}