Open simplejackcoder opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.
Author: | simplejackcoder |
---|---|
Assignees: | - |
Labels: | `area-System.Runtime.Intrinsics`, `untriaged` |
Milestone: | - |
At the very least, this proposal needs to be updated to follow the API Proposal
outline, similarly to https://github.com/dotnet/runtime/issues/66467
These instructions are available in user-mode and don't appear to have any oddities that would prevent their support in the JIT. waitpkg
is a relatively new ISA that I believe is only supported in Tremont, Alder Lake, Sapphire Rapids
at the moment and is currently Intel
only.
It might be interesting to see if @stephentoub, @jkotas has anywhere this could be used in-box. Things like working with the GC would likely not be easy to support and like pause
/yield
these are likely difficult to use APIs. It might be better to see if the functionality could be implicitly used where possible or if a more general set of "efficient/xplat" APIs covering this functionality is a "better idea".
For reference:
tpause
is timed pause
and lets you basically wait for n
cycles in a "power" or "efficiency" modeumonitor
is monitor address
and lets you set up a hardware trigger that occurs when a given address is written (the range is queried via cpuid
)umwait
is monitor wait
and basically does tpause
until the time stamp counter passes or the setup umonitor
address is triggeredThe C++ signatures for these are:
uint8_t _tpause(uint32_t control, uint64_t counter);
void _umonitor(void *address);
uint8_t _umwait(uint32_t control, uint64_t counter);
Rust provides similarly named APIs.
This issue has been marked needs-author-action
since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.
It might be interesting to see if @stephentoub, @jkotas has anywhere this could be used in-box
It would be interesting to experiment with replacing the lock spin loops using these intrinsics. It should provide better overall performance, especially on machines with many cores.
The common locks are implemented in C/C++ in CoreCLR today, so we would need to reimplement them in C# first before the managed intrinsics can be used for those.
This issue has been automatically marked no-recent-activity
because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity
.
Can somebody create an API-Shape for this Proposal?
Can somebody create an API-Shape for this Proposal?
I came up with something like this:
namespace System.Runtime.Intrinsics.X86;
[Intrinsic]
[CLSCompliant(false)]
public abstract class WaitPkg : X86Base
{
public static new bool IsSupported { get; }
public static unsafe void SetUpUserLevelMonitorAddress(void* address);
public static byte WaitForUserLevelMonitor(uint control, ulong counter);
public static byte TimedPause(uint control, ulong counter);
[Intrinsic]
public new abstract class X64 : X86Base.X64
{
internal X64() { }
public static new bool IsSupported { get; }
}
}
I hope it helps.
I've updated it loosely based on the above. Made a couple tweaks and gave an explanation of why GetMaximumWaitTime
and GetIsC02Supported
can't be exposed
Looks good as proposed.
namespace System.Runtime.Intrinsics.X86;
[Intrinsic]
[CLSCompliant(false)]
public abstract class WaitPkg : X86Base
{
public static new bool IsSupported { get; }
// UMONITOR: void _umonitor(void *address);
public static unsafe void SetUpUserLevelMonitor(void* address);
// UMWAIT: uint8_t _umwait(uint32_t control, uint64_t counter);
public static bool WaitForUserLevelMonitor(uint control, ulong counter);
// TPAUSE: uint8_t _tpause(uint32_t control, uint64_t counter);
public static bool TimedPause(uint control, ulong counter);
[Intrinsic]
public new abstract class X64 : X86Base.X64
{
internal X64() { }
public static new bool IsSupported { get; }
}
}
Summary
x86 based hardware introduced the
waitpkg
ISA back in 2020 which can be used to better facilitate low-power and low-latency spin-loops.API Suggestion
Additional Considerations
There is a model specific register
IA32_UMWAIT_CONTROL
(MSR 0xE1
) which provides additional information. However, model specific registers can only be read by ring 0 (the kernel) and as such this information is not available to user mode programs without the underlying OS exposing an explicit API. As such, this information is not surfaced to the end user.IA32_UMWAIT_CONTROL[31:2]
— Determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. A zero value indicates no maximum time. The maximum time value is a 32-bit value where the upper 30 bits come from this field and the lower two bits are zero.IA32_UMWAIT_CONTROL[1]
— Reserved.IA32_UMWAIT_CONTROL[0]
— C0.2 is not allowed by the OS. Value of “1” means all C0.2 requests revert to C0.1.IA32_UMWAIT_CONTROL[0]
is1
, it simply means that a user call ofTimedPause
wherecontrol == 0
will be treated ascontrol == 1
:Likewise, if the user specified
counter
is larger thanIA32_UMWAIT_CONTROL[31:2]
thenTimedPause
returnstrue
indicating that the pause ended due to expiration of the operating system time-limit rather than reaching/exceeding the specifiedcounter
(returnsfalse
). The same applies toWaitForUserLevelMonitor
.