Proposal: introduce IMemory interface to improve performance

artelk commented 4 years ago

Motivation

Currently the Memory.Span property is quite complex and it is slower than in could be (see benchmarks below). It will probably be even slower with Utf8String support: https://github.com/dotnet/runtime/blob/fff8f5ab763f3c1eea95ab201bfde9bd4b6e62fb/src/libraries/System.Private.CoreLib/src/System/Memory.cs#L313 And for the unmanaged memory exposed via MemoryManager it is already significantly slower than it is for the array-based memory.

Proposal

The proposal is to introduce IMemory and IReadOnlyMemory interfaces so you can have multiple memory types. Each type could be implemented and optimized for its own underlying memory provider separately. The interfaces could look like:

public interface IReadOnlyMemory<T, TMemory>
    where TMemory : struct, IReadOnlyMemory<T, TMemory>
{
    int Length { get; }
    bool IsEmpty { get; }
    ReadOnlySpan<T> Span { get; }
    MemoryHandle Pin();
    TMemory Slice(int start, int length);
    TMemory Slice(int start);
}

public interface IMemory<T, TMemory>: IReadOnlyMemory<T, TMemory>
    where TMemory : struct, IMemory<T, TMemory>
{
    new Span<T> Span { get; }
}

The second generic parameter TMemory is needed for the Slice methods and it is assumed to be the same as the memory type implementing the interface. These interfaces are supposed to be implemented only by structs (not classes) and passed to generic methods parametrized by the memory type with constraints (in this case the JIT compiler compiles them for each memory struct type separately avoiding boxing and it also can inline the memory type methods invoked):

public class Stream
{
    public virtual ValueTask WriteAsync<TReadOnlyBuffer>(TReadOnlyBuffer buffer, CancellationToken cancellationToken = default)
        where TReadOnlyBuffer : struct, IReadOnlyMemory<byte, TReadOnlyBuffer>
    {
        return DoWriteAsync(buffer, cancellationToken);
    }

    private ValueTask DoWriteAsync<TReadOnlyBuffer>(TReadOnlyBuffer buffer, CancellationToken cancellationToken)
        where TReadOnlyBuffer : struct, IReadOnlyMemory<byte, TReadOnlyBuffer>
    {
        // ...
    }
}

public readonly struct CustomMemory<T> : IMemory<T, CustomMemory<T>>
{
    //... 
}

Stream stream = ...;
CustomMemory<byte> customMemory = ...;
stream.WriteAsync(customMemory);

Here the customMemory is passed to the WriteAsync as is (not being converted to some readonly type) while the WriteAsync method can only access the Span property returning the ReadOnlySpan. The existing System.Memory could implement IMemory<T, Memory> and the System.ReadOnlyMemory could implement IReadOnlyMemory<T, ReadOnlyMemory> for unification purposes so you will also be able to pass them to the methods like WriteAsync.

A separate ReadOnlyMemory struct could also be useful e.g. to be returned from some methods. That can be a wrapper for the RW-memory:

public readonly struct ReadOnlyMemory<T, TMemory> : IReadOnlyMemory<T, ReadOnlyMemory<T, TMemory>>
    where TMemory : struct, IReadOnlyMemory<T, TMemory>
{
    private readonly TMemory _memory;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public ReadOnlyMemory(TMemory memory) => _memory = memory;
    // ...

In this case the layout of the ReadOnlyMemory<T, SomeMemory> would be the same as the layout of the SomeMemory. Of course you will also be able to implement the IReadOnlyMemory<T, TMemory> directly (e.g. if you don't plan to have a RW-memory for your memory provider).

There is a problem with implementing extension methods for such memory types. Example:

public static class MemoryExtensions
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static ReadOnlyMemory<T, TMemory> AsReadOnly<T, TMemory>(this TMemory memory)
        where TMemory : struct, IMemory<T, TMemory>
    {
        return memory;
    }
}

var m = customMemory.AsReadOnly();

Unfortunately the compiler won't be able to infer the T parameter from the TMemory parameter and it will ask you to explicitly specify them both. Possible solution to help the compiler is to intruduce a helper struct with two generic parameters:

public readonly struct Wrapper<TValue, T>
{
    public readonly TValue Value;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public Wrapper(TValue value) => Value = value;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static implicit operator Wrapper<TValue, T>(TValue value) => new Wrapper<TValue, T>(value);
}

And then add a property to the IReadOnlyMemory returning that:

public interface IReadOnlyMemory<T, TMemory>
    where TMemory : struct, IReadOnlyMemory<T, TMemory>
{
    // ...
    Wrapper<TMemory, T> It { get; }
}

public readonly struct CustomMemory<T> : IMemory<T, CustomMemory<T>>
{
    //... 
    public Wrapper<CustomMemory<T>, T> It
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => this;
    }
}

So the extension methods could be implemented on the Wrapper type level:

public static class MemoryExtensions
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static ReadOnlyMemory<T, TMemory> AsReadOnly<T, TMemory>(this Wrapper<TMemory, T> memory)
        where TMemory : struct, IMemory<T, TMemory>
    {
        return memory.Value;
    }

    // Trim, TrimStart, TrimEnd...
}

var m = customMemory.It.AsReadOnly();

For existing memory types shortcuts could be also implemented to not require accessing the It property directly. They should reuse the implementation of the methods taking the Wrapper:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ReadOnlyMemory<T, CustomMemory<T>> AsReadOnly<T>(this CustomMemory<T> memory)
{
    return memory.It.AsReadOnly();
}

// Trim, TrimStart, TrimEnd...

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ReadOnlyMemory<T, ArrayMemory<T>> AsReadOnly<T>(this ArrayMemory<T> memory)
{
    return memory.It.AsReadOnly();
}

// Trim, TrimStart, TrimEnd...

// now you can write like this:
var m = customMemory.AsReadOnly();

The source code and the benchmark results

IReadOnlyMemory and IMemory interfaces

```cs public interface IReadOnlyMemory where TMemory : struct, IReadOnlyMemory { int Length { get; } bool IsEmpty { get; } ReadOnlySpan Span { get; } MemoryHandle Pin(); TMemory Slice(int start, int length); TMemory Slice(int start); Wrapper It { get; } } public interface IMemory: IReadOnlyMemory where TMemory : struct, IMemory { new Span Span { get; } } ```

ReadOnlyMemory struct

```cs public readonly struct ReadOnlyMemory : IReadOnlyMemory> where TMemory : struct, IReadOnlyMemory { private readonly TMemory _memory; [MethodImpl(MethodImplOptions.AggressiveInlining)] public ReadOnlyMemory(TMemory memory) => _memory = memory; [MethodImpl(MethodImplOptions.AggressiveInlining)] public static implicit operator ReadOnlyMemory(TMemory memory) => new ReadOnlyMemory(memory); public int Length { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => _memory.Length; } public bool IsEmpty { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => _memory.IsEmpty; } public ReadOnlySpan Span { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => _memory.Span; } public Wrapper, T> It { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => this; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public MemoryHandle Pin() => _memory.Pin(); [MethodImpl(MethodImplOptions.AggressiveInlining)] public ReadOnlyMemory Slice(int start, int length) => _memory.Slice(start, length); [MethodImpl(MethodImplOptions.AggressiveInlining)] public ReadOnlyMemory Slice(int start) => _memory.Slice(start); } } ```

ArrayMemory that is supposed to be a part of .Net API

```cs public readonly struct ArrayMemory : IMemory> { private readonly T[] _array; private readonly int _start; [MethodImpl(MethodImplOptions.AggressiveInlining)] public ArrayMemory(T[] array) { if (array == null) ThrowArgumentNullException(); _array = array; _start = default; Length = _array.Length; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public ArrayMemory(T[] array, int start) { if (array == null) ThrowArgumentNullException(); if (start < 0 || start >= array.Length) ThrowArgumentOutOfRangeException(); _array = array; _start = start; Length = _array.Length - start; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public ArrayMemory(T[] array, int start, int length) { if (array == null) ThrowArgumentNullException(); if (start < 0 || length < 0 || length > array.Length - start) ThrowArgumentOutOfRangeException(); _array = array; _start = start; Length = length; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static implicit operator ArrayMemory(T[] array) => new ArrayMemory(array); public int Length { get; } public bool IsEmpty { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => Length == 0; } public Span Span { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => _array.AsSpan(_start, Length); // I assume there should be a separate memory type for the pre-pinned arrays so the _start isn't negative } ReadOnlySpan IReadOnlyMemory>.Span { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => Span; } public Wrapper, T> It { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => this; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public ArrayMemory Slice(int start, int length) => new ArrayMemory(_array, _start + start, length); [MethodImpl(MethodImplOptions.AggressiveInlining)] public ArrayMemory Slice(int start) => new ArrayMemory(_array, _start + start); [MethodImpl(MethodImplOptions.AggressiveInlining)] public unsafe MemoryHandle Pin() { GCHandle handle = GCHandle.Alloc(_array, GCHandleType.Pinned); void* pointer = Unsafe.Add(Unsafe.AsPointer(ref _array[0]), _start); return new MemoryHandle(pointer, handle); } [MethodImpl(MethodImplOptions.NoInlining)] private static void ThrowArgumentNullException() => throw new ArgumentNullException(); [MethodImpl(MethodImplOptions.NoInlining)] private static void ThrowArgumentOutOfRangeException() => throw new ArgumentOutOfRangeException(); } ```

CustomMemoryProvider class that can expose some unmanaged memory as System.Memory via MemoryManager as well as its own CustomMemory

```cs public sealed partial class CustomMemoryProvider : MemoryManager { private IntPtr memory; public CustomMemoryProvider(int length) { this.memory = Marshal.AllocHGlobal(Marshal.SizeOf() * length); this.Length = length; } public int Length { get; } public bool IsDisposed { get; private set; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public override unsafe Span GetSpan() => new Span((void*)memory, Length); [MethodImpl(MethodImplOptions.AggressiveInlining)] public unsafe CustomMemory GetCustomMemory() => new CustomMemory((void*)memory, Length); ~CustomMemoryProvider() { Dispose(false); } [MethodImpl(MethodImplOptions.AggressiveInlining)] public override unsafe MemoryHandle Pin(int elementIndex = 0) { if ((uint)elementIndex > Length) throw new ArgumentOutOfRangeException(nameof(elementIndex)); return new MemoryHandle(Unsafe.Add((void*)memory, elementIndex)); } public override void Unpin() { } protected override void Dispose(bool disposing) { if (IsDisposed) return; Marshal.FreeHGlobal(memory); memory = IntPtr.Zero; IsDisposed = true; } protected override bool TryGetArray(out ArraySegment arraySegment) { arraySegment = default; return false; } } ```

The CustomMemory struct

```cs public sealed partial class CustomMemoryProvider { public readonly unsafe struct CustomMemory : IMemory { private readonly void* _memory; [MethodImpl(MethodImplOptions.AggressiveInlining)] internal CustomMemory(void* memory, int length) { _memory = memory; Length = length; } public int Length { get; } public bool IsEmpty { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => Length == 0; } public unsafe Span Span { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => new Span(_memory, Length); } ReadOnlySpan IReadOnlyMemory.Span { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => new ReadOnlySpan(_memory, Length); } public Wrapper It { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => this; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public MemoryHandle Pin() => new MemoryHandle(_memory); [MethodImpl(MethodImplOptions.AggressiveInlining)] public unsafe CustomMemory Slice(int start) { int length = Length - start; if (start < 0 || length < 0) ThrowArgumentOutOfRangeException(); return new CustomMemory(Unsafe.Add(_memory, start), length); } [MethodImpl(MethodImplOptions.AggressiveInlining)] public CustomMemory Slice(int start, int length) { if (start < 0 || length < 0 || length > Length - start) ThrowArgumentOutOfRangeException(); return new CustomMemory(Unsafe.Add(_memory, start), length); } [MethodImpl(MethodImplOptions.NoInlining)] private static void ThrowArgumentOutOfRangeException() => throw new ArgumentOutOfRangeException(); } } ```

Benchmark tests code

```cs class Program { static void Main(string[] args) { BenchmarkRunner.Run(typeof(Program).Assembly); } } public class Tests { private CustomMemoryProvider _memoryProvider; private Memory _memoryFromArray; private Memory _memoryFromManagerMemory; private ArrayMemory _customArrayMemory; private CustomMemoryProvider.CustomMemory _customMemory; [GlobalSetup] public void Setup() { var array = new byte[1024]; _memoryFromArray = array; _customArrayMemory = array; _memoryProvider = new CustomMemoryProvider(1024); _memoryFromManagerMemory = _memoryProvider.Memory; _customMemory = _memoryProvider.GetCustomMemory(); } [Benchmark] public void MemoryFromArray() => Consume(_memoryFromArray); [Benchmark] public void MemoryFromManagerMemory() => Consume(_memoryFromManagerMemory); [Benchmark] public void CustomArrayMemory() => Consume(_customArrayMemory); [Benchmark] public void CustomMemory() => Consume(_customMemory); private static void Consume(TByteMemory memory) where TByteMemory : struct, IMemory { memory.Slice(42).Span[42] = 42; } private static void Consume(Memory memory) { memory.Slice(42).Span[42] = 42; } } ```

The benchmark results

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362 Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores .NET Core SDK=3.1.101 [Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method	Mean	Error	StdDev
MemoryFromArray	3.770 ns	0.0769 ns	0.0720 ns
MemoryFromManagerMemory	4.987 ns	0.0575 ns	0.0510 ns
CustomArrayMemory	3.073 ns	0.0885 ns	0.0784 ns
CustomMemory	2.210 ns	0.0522 ns	0.0462 ns

Open questions

Open the open questions

There are multiple Marshal methods that require Memory and ReadOnlyMemory to have the same fields and the same layout. Probably it won't be possible to reproduce them all for all the types implementing the IMemory and IReadOnlyMemory. Only the ReadOnlyMemory reference can be reinterpeted as SomeMemory and vice versa. We could create a special wrapper for IReadOnlyMemory that implements the IMemory which Span just gets the ReadOnlySpan from the inner memory and returns it reinterpreting that as Span. I'm not sure if that is ok. It might be difficult to add support of the new memory types into existing classes implemented using the current Memory. Especially if the Memory is somehow explicitly exposed to the client. I've checked the Socket class implementation. The class [SocketAsyncEventArgs](https://github.com/dotnet/runtime/blob/fff8f5ab763f3c1eea95ab201bfde9bd4b6e62fb/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs#L19) now has a public property MemoryBuffer returning the field _buffer of type Memory. Fortunately the SocketAsyncEventArgs itself is not exposed with the SendAsync(ReadOnlyMemory) and ReceiveAsync(Memory) methods. We won't be able to store the custom memory to the field. Instead all the methods directly or indirectly accessing the buffer have to be generic and explicitly take the buffer in parameters. Some of them are public so we need to copy them to private ones and make them generic. The methods invoked by the IOCP callback also shouldn't assume _buffer is initialized. They probably need to use _singleBufferHandle to get the pointer. Or we could create a special memory type that stores the pointer and the length. That memory type instance can be saved to some field just after we pinned the original memory. So the methods like [LogBuffer](https://github.com/dotnet/runtime/blob/fff8f5ab763f3c1eea95ab201bfde9bd4b6e62fb/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs#L1078) have to use that new field if both _bufferList and _buffer are nulls.

GrabYourPitchforks commented 4 years ago

Thanks for the proposal! Are there other benefits to this beyond improving the performance of the Memory<T>.Span property getter? It's likely that this proposal will significantly increase the number of receivers which require generic specialization. This would negatively impact size-on-disk and would increase the JIT's workload.

We do care about the performance of the Memory<T>.Span property getter. We continue to make perf improvements to this method. But the recommended way of using Memory<T> is to call the Span property getter as infrequently as possible. Generally, call it before entering a loop or other hot path, then within that hot code operate only on the Span<T> that was just pulled out. This means that it's unlikely that we would complicate the API shape of Memory<T> just to shave one or two nanoseconds off the Span property getter.

Your concerns about Utf8String complicating the property getter are understandable. But our benchmarks show that there's not a noticeable perf impact from enabling this support. (Reminder: our benchmarks largely come from the master branch, where this ifdef has been enabled for ages.) The optimizations we've been able to perform since the initial 2.1 release of Memory<T> dominate any overhead introduced by adding support for that type.

artelk commented 4 years ago

Thank you for your reply!

Thanks for the proposal! Are there other benefits to this beyond improving the performance of the Memory<T>.Span property getter? It's likely that this proposal will significantly increase the number of receivers which require generic specialization. This would negatively impact size-on-disk and would increase the JIT's workload.

Yes, that is mainly about improving the performance. I believe there will be only two RW memory types provided by the BCL itself: the existing Memory<T> and the new ArrayMemory<T>. And there will be only two RO memory types: the existing ReadOnlyMemory<T> and the new ReadOnlyMemory<T, TMemory>. Will the impact be too big?

Developers who are going to work with some unmanaged memory will have two options: either implement the MemoryManager and expose the unmanaged memory chunks as Memory<T> or create a new Memory type. I believe the last option is simpler in implementation (this is the 2nd benefit) and that would work faster at the same time.

The existing Memory<T> could be simply made compatible with the methods taking the TMemory where TMemory : struct, IMemory<T, TMemory> just by adding : IMemory<T, Memory<T>> to the Memory<T> struct declaration.

I would also add operators for implicit casts ArrayMemory<T>-->Memory<T>, ArrayMemory<T>-->ReadOnlyMemory<T> and ReadOnlyMemory<T, ArrayMemory<T>>-->ReadOnlyMemory<T>. So they could be passed to (not generalized yet) methods taking Memory<T> and ReadOnlyMemory<T> without any additional coding.

We do care about the performance of the Memory<T>.Span property getter. We continue to make perf improvements to this method. But the recommended way of using Memory<T> is to call the Span property getter as infrequently as possible. Generally, call it before entering a loop or other hot path, then within that hot code operate only on the Span<T> that was just pulled out. This means that it's unlikely that we would complicate the API shape of Memory<T> just to shave one or two nanoseconds off the Span property getter.

I believe with the proposed approach this recommendation won't be needed anymore. Sometimes it is hard to avoid calling the Span property too often and if the operations between the subsequent calls also take a few nanoseconds the negative impact on the performance could be significant. Example: imagine your method takes a Stream or Socket and produces an IAsyncEnumerator<SomeSmallStruct>; you use the new System.IO.Pipelines for performance reason; the SomeSmallStruct is quite small so in most cases the First segment of the ReadOnlySequence<T> returns a ReadOnlyMemory<T> that is bigger than the struct size so the overhead produced by the ReadOnlySequence<T> itself is minimal; you yield return every SomeSmallStruct produced so you call the Span property too often (the operations between the subsequent calls can also take a few nanoseconds). (offtopic: btw, that would be really cool to have such method in BCL: the generic method producing the IAsyncEnumerator<T> from the Stream)

Another example: with current Utf8JsonWriter implementation the Span property is called on every WriteStartObject/WriteStartArray/WriteEndObject/WriteEndArray/WriteNullValue/WriteBooleanValue/...

dotnet / runtime