Open MineCake147E opened 10 months ago
Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.
Author: | MineCake147E |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Memory`, `untriaged`, `needs-area-label` |
Milestone: | - |
Alternative Designs
These APIs can start as an independent 3rd party package.
These APIs can start as an independent 3rd party package.
That would be a good first step as well.
However, I came up with another use case of async memmove
without DSA: copying multiple blocks of large data simultaneously while managing each task in a simple way.
var t0 = src0.CopyToAsync(dst0);
var t1 = src1.CopyToAsync(dst1);
// ...
await t0;
await t1;
// ...
In this case, APIs like CopyToAsync
could need another argument to specify not to execute it synchronously even though no hardware accelerator is available.
Background and motivation
memmove
andmemset
are the most common operations from the earliest days of computing to today. They are included in most practical programming languages, except assembly, and have been optimized incrementally for a long, long time. However, these two functions are reaching their limits of optimization, especially in terms of power consumption and CPU utilization. While simultaneous multithreading has slightly improved CPU utilization, it doesn't solve the fundamental problem underlying these two functions: during operation, a CPU core runningmemmove
ormemset
is only copying memory to and from CPU registers, and not performing any actual operations that alters the data.To solve this fundamental problem, Intel started to include one or more accelerators called
Data Streaming Accelerator
(DSA for short) in most of its Xeon CPUs since theSapphire Rapids
. DSA can only perform a limited number of memory operations, includingmemmove
and a limited subset ofmemset
(the pattern size ofmemset
in bits must be a power of 2, 16 bytes (128 bits) or fewer). But DSA performs these operations asynchronously, independent of the CPU. And in most cases, it's much faster than a CPU core doing the same thing.Theoretically, We can use DSA to perform
memmove
andmemset
asynchronously, by either returning aValueTask
that waits until the DSA finishes processing, or returning theValueTask.CompletedTask
if the DSA is not available or the task is performed synchronously for some reason. The same principle could be applied if competitors, especially AMD and ARM, begin to incorporate similar hardware accelerators in their CPUs, SoCs, etc. in the future. And in my humble opinion, it's likely to happen. This is why I propose these APIs as a cross-platform thing.Although implementing the actual hardware acceleration support could be hard, the APIs below can trivially be implemented by just assuming no hardware accelerators are available, as a first step. The only exception is
Buffer.MemoryFillAsync
which needs to be implemented by software as well, but it can be implemented by slightly modifying the code fromSpan<T>.Fill
. The actual hardware acceleration support could be implemented in the near future.As a side note, I have a PC with a Xeon w5-2455X with one DSA included, so I would be happy to help with debugging.
API Proposal
API Usage
Alternative Designs
Buffer.MemoryFillAsync
could also be included as well, as there is no such thing currently.Risks
Windows 11 Home
someday?