dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

[API Proposal]: Asynchronous memory operation APIs for future hardware offloading and acceleration #97194

Open MineCake147E opened 10 months ago

MineCake147E commented 10 months ago

Background and motivation

memmove and memset are the most common operations from the earliest days of computing to today. They are included in most practical programming languages, except assembly, and have been optimized incrementally for a long, long time. However, these two functions are reaching their limits of optimization, especially in terms of power consumption and CPU utilization. While simultaneous multithreading has slightly improved CPU utilization, it doesn't solve the fundamental problem underlying these two functions: during operation, a CPU core running memmove or memset is only copying memory to and from CPU registers, and not performing any actual operations that alters the data.

To solve this fundamental problem, Intel started to include one or more accelerators called Data Streaming Accelerator(DSA for short) in most of its Xeon CPUs since the Sapphire Rapids. DSA can only perform a limited number of memory operations, including memmove and a limited subset of memset (the pattern size of memset in bits must be a power of 2, 16 bytes (128 bits) or fewer). But DSA performs these operations asynchronously, independent of the CPU. And in most cases, it's much faster than a CPU core doing the same thing.

Theoretically, We can use DSA to perform memmove and memset asynchronously, by either returning a ValueTask that waits until the DSA finishes processing, or returning the ValueTask.CompletedTask if the DSA is not available or the task is performed synchronously for some reason. The same principle could be applied if competitors, especially AMD and ARM, begin to incorporate similar hardware accelerators in their CPUs, SoCs, etc. in the future. And in my humble opinion, it's likely to happen. This is why I propose these APIs as a cross-platform thing.

Although implementing the actual hardware acceleration support could be hard, the APIs below can trivially be implemented by just assuming no hardware accelerators are available, as a first step. The only exception is Buffer.MemoryFillAsync which needs to be implemented by software as well, but it can be implemented by slightly modifying the code from Span<T>.Fill. The actual hardware acceleration support could be implemented in the near future.

As a side note, I have a PC with a Xeon w5-2455X with one DSA included, so I would be happy to help with debugging.

API Proposal

namespace System
{
    public readonly struct Memory<T> : IEquatable<Memory<T>>
    {
        public ValueTask CopyToAsync(Memory<T> destination);

        public ValueTask<bool> TryCopyToAsync(Memory<T> destination);

        public ValueTask FillAsync(T value);

        public ValueTask ClearAsync();
    }

    public readonly struct ReadOnlyMemory<T> : IEquatable<ReadOnlyMemory<T>>
    {
        public ValueTask CopyToAsync(Memory<T> destination);

        public ValueTask<bool> TryCopyToAsync(Memory<T> destination);
    }

    public static partial class Buffer
    {
        public static unsafe ValueTask MemoryCopyAsync(void* source, void* destination, ulong destinationSizeInBytes, ulong sourceBytesToCopy);
        public static unsafe ValueTask MemoryFillAsync(void* source, void* destination, ulong destinationSizeInBytes, ulong sourcePatternBytesToFill);
    }
}

API Usage

// The `source` and `destination` point to some mebibytes of memory region
// Let DSA copy the `source` to the `destination` asynchronously if it's available
var task = source.CopyToAsync(destination);
// Some operations can be executed while copying
await task;

Alternative Designs

Risks

ghost commented 10 months ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation `memmove` and `memset` are the most common operations from the earliest days of computing to today. They are included in most practical programming languages, except assembly, and have been optimized incrementally for a long, long time. However, these two functions are reaching their limits of optimization, especially in terms of power consumption and CPU utilization. While simultaneous multithreading has slightly improved CPU utilization, it doesn't solve the fundamental problem underlying these two functions: during operation, a CPU core running `memmove` or `memset` is only copying memory to and from CPU registers, and not performing any actual operations that alters the data. To solve this fundamental problem, Intel started to include one or more accelerators called `Data Streaming Accelerator`(DSA for short) in most of its Xeon CPUs since the `Sapphire Rapids`. DSA can only perform a limited number of memory operations, including `memmove` and a limited subset of `memset` (the pattern size of `memset` in bits must be a power of 2, 16 bytes (128 bits) or fewer). But DSA performs these operations asynchronously, independent of the CPU. And in most cases, it's much faster than a CPU core doing the same thing. Theoretically, We can use DSA to perform `memmove` and `memset` asynchronously, by either returning a `ValueTask` that waits until the DSA finishes processing, or returning the `ValueTask.CompletedTask` if the DSA is not available or the task is performed synchronously for some reason. The same principle could be applied if competitors, especially AMD and ARM, begin to incorporate similar hardware accelerators in their CPUs, SoCs, etc. in the future. And in my humble opinion, it's likely to happen. This is why I propose these APIs as a cross-platform thing. Although implementing the actual hardware acceleration support could be hard, the APIs below can trivially be implemented by just assuming no hardware accelerators are available, as a first step. The only exception is `Buffer.MemoryFillAsync` which needs to be implemented by software as well, but it can be implemented by slightly modifying the code from `Span.Fill`. The actual hardware acceleration support could be implemented in the near future. As a side note, I have a PC with a Xeon w5-2455X with one DSA included, so I would be happy to help with debugging. ### API Proposal ```csharp namespace System { public readonly struct Memory : IEquatable> { public ValueTask CopyToAsync(Memory destination); public ValueTask TryCopyToAsync(Memory destination); public ValueTask FillAsync(T value); public ValueTask ClearAsync(); } public readonly struct ReadOnlyMemory : IEquatable> { public ValueTask CopyToAsync(Memory destination); public ValueTask TryCopyToAsync(Memory destination); } public static partial class Buffer { public static unsafe ValueTask MemoryCopyAsync(void* source, void* destination, ulong destinationSizeInBytes, ulong sourceBytesToCopy); public static unsafe ValueTask MemoryFillAsync(void* source, void* destination, ulong destinationSizeInBytes, ulong sourcePatternBytesToFill); } } ``` ### API Usage ```csharp // The `source` and `destination` point to some mebibytes of memory region // Let DSA copy the `source` to the `destination` asynchronously if it's available var task = source.CopyToAsync(destination); // Some operations can be executed while copying await task; ``` ### Alternative Designs * There may be better names for each identifier. * The synchronous variant of the `Buffer.MemoryFillAsync` could also be included as well, as there is no such thing currently. * Many other operations, like comparison, could also be offloaded to DSA. ### Risks * There are no benefits at all for environments without any hardware accelerators available anyway. * Or should these operations be offloaded to another CPU core? Does it worth doing? * For Intel DSA, only a few CPUs can benefit from these features today. * This means that debugging these features is really hard today. * And either Intel, operating system vendors like Microsoft, or third parties, should provide some drivers, either kernel mode or user mode, or both. * For example, the Intel DSA Driver (IDXD) was introduced in the Linux kernel version 5.6. * Will Microsoft have one for `Windows 11 Home` someday? * And .NET should be able to detect the DSA availability, in order to utilize them. * Some users may abuse the asynchronous variant without considering the benefits or drawbacks, regardless of hardware acceleration availability. * But it can be mitigated by automatically switching to synchronous variant depending on the hardware acceleration availability and the size of data to process.
Author: MineCake147E
Assignees: -
Labels: `api-suggestion`, `area-System.Memory`, `untriaged`, `needs-area-label`
Milestone: -
jkotas commented 10 months ago

Alternative Designs

These APIs can start as an independent 3rd party package.

MineCake147E commented 10 months ago

These APIs can start as an independent 3rd party package.

That would be a good first step as well. However, I came up with another use case of async memmove without DSA: copying multiple blocks of large data simultaneously while managing each task in a simple way.

var t0 = src0.CopyToAsync(dst0);
var t1 = src1.CopyToAsync(dst1);
// ...
await t0;
await t1;
// ...

In this case, APIs like CopyToAsync could need another argument to specify not to execute it synchronously even though no hardware accelerator is available.