dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

API Proposal: Add Span accessor for MemoryMapped files #37227

Open Suchiman opened 4 years ago

Suchiman commented 4 years ago

Background and Motivation

.NET had acccess to memory mapped files for a long time but using them either requires unsafe pointers, a BinaryReader|Writer like API or Stream. With the advent of Span one could access them more easily and pass them directly without intermediate copies / buffers to a larger set of APIs.

Proposed API

We would add

namespace System.IO
{
    public class MemoryMappedFile : IDisposable
    {
+        public MemoryMappedMemoryManager CreateMemoryManager();
+        public MemoryMappedMemoryManager CreateMemoryManager(long offset, int size);
+        public MemoryMappedMemoryManager CreateMemoryManager(long offset, int size, MemoryMappedFileAccess access);
    }

+    public class MemoryMappedMemoryManager : MemoryManager<byte>
+    {
+    }
}

Unlike most other Span APIs we allow passing long offset and int size in order to work with files larger than 2GB which we cannot directly slice into due to all Span related classes being int length based. If you need to work with files larger than 2GB, you need to call CreateMemoryManager with increasing offsets.

Usage Examples

using MemoryMappedFile file = MemoryMappedFile.CreateFromFile("test.txt");
using MemoryMappedMemoryManager manager = file.CreateMemoryManager();
Memory<byte> memory = manager.Memory;
Console.WriteLine(Encoding.UTF8.GetString(memory.Span));

Alternative Designs

We could also add a string.Create like API to the MemoryMappedViewAccessor where MemoryMappedViewAccessor manages the lifecycle and the design ensures that the Span does not outlive the MemoryMappedViewAccessor.

namespace System.IO
{
    public class MemoryMappedViewAccessor : IDisposable
    {
+        public void UseSpan<TState>(TState state, SpanAction<byte, TState> action);
+        public TResult UseSpan<TState, TResult>(TState state, SpanAction<byte, TState> action);
    }

which would be used like

using MemoryMappedFile file = MemoryMappedFile.CreateFromFile("test.txt");
using MemoryMappedViewAccessor accessor = file.CreateViewAccessor();
accessor.UseSpan((object)null, (span, state) =>
{
    Console.WriteLine(Encoding.UTF8.GetString(span));
});

Risks

Low risk as far as only adding APIs is concerned. Designs that allow the Span to outlive the memory mapped file could encounter an access violation if trying to use the span past that point.

GrabYourPitchforks commented 4 years ago

The Span could be used after disposal of the MemoryMappedFile(ViewAccessor) though, leading to an access violation when attempting to use the Span.

That's a substantial risk.

Edit: Consider your earlier example, which I've copied again below.

using MemoryMappedFile file = MemoryMappedFile.CreateFromFile("test.txt");
ReadOnlySpan<byte> span = file.CreateReadOnlySpan();
Console.WriteLine(Encoding.UTF8.GetString(span));

If you forget to put the using keyword in front of that, your application could AV. Just due to a single missing keyword. Nobody reviewing this code would have any indication that they're entering a world of manually managed memory. See also https://github.com/dotnet/runtime/issues/33768, which any UnmanagedMemoryAccessor which returns a Memory<T> or Span<T> instance would immediately run afoul of.

Suchiman commented 4 years ago

@GrabYourPitchforks An alternative design that ensures that the Span does not outlive the MemoryMappedFile would be similiar to string.Create, i've added that to Alternative Designs.

Now that you mention it though, MemoryManager<T> is the type i've had in mind for this design but couldn't remember.

Suchiman commented 4 years ago

@GrabYourPitchforks i've revised the proposal with MemoryManager and removed the ones that seem too dangerous. MemoryManager however does not alleviate the danger of an access violation i presume? I've only read about it in passing.

GrabYourPitchforks commented 4 years ago

Your "alternative design" callback-based proposal doesn't run the same risk of AVs as returning the span directly. The implementation of MemoryMappedFile can ensure that the span remains alive during the duration of the callback.

The MemoryManager part would still remain dangerous since it's a wrapper around unmanaged memory. At that point, I'm not opposed to forcing the caller to drop down to unsafe C# code, obtain the raw void*, and create the Memory<T> or Span<T> themselves. Once the caller writes the word "unsafe", it's on them to ensure proper memory management.

davidfowl commented 4 years ago

I really hope we don’t end up with this callback design everywhere. I understand we don’t have the means to track lifetime like this outside of the stack but I fear we’ll end up with a set of overloads like this.

mjsabby commented 4 years ago

@GrabYourPitchforks @VSadov @Maoni0 what if we had a Pinned Byte Array Heap backed by memory mapped files ...

sakno commented 4 years ago

@Suchiman , you can third-party implementation like this. The library is on NuGet.

My concern with memory-mapped file is different. I would like to iterate through mapped segments using ReadOnlySequence<byte> value type. To do that, ReadOnlySequenceSegment<T> should provide "activation" and "deactivation" methods:

public abstract class ReadOnlySequenceSegment<T>
{
  internal protected virtual void Activate();
  internal protected virtual void Deactivate();
}

Activate method can be called by ReadOnlySequence when the segment becomes active, i.e. selected as the current segment. With this two methods, it's possible to write lazy loading and unloading segments of memory-mapped file.

GrabYourPitchforks commented 4 years ago

@davidfowl I agree. One thing I'd like to see the runtime maintain is an arbitrary memory range to object mapping. For example:

0x00007000_10000000 .. 0x00007000_1FFFFFFF -> [object A]
0x00007000_20000000 .. 0x00007000_200007FF -> [object B]

When the GC is running, for any T& seen (not T* since they're not GC-tracked), any object associated with the range will be seen as "live" and won't be collected. This means that you can associate arbitrary memory ranges with explicit SafeHandles.

It doesn't solve all problems. For example, it could make GC more expensive because now it's yet another location that needs to be queried during GC. Additionally, it won't solve AVs resulting from use-after-free. We still don't want to end up with an API surface where this is possible:

var obj = GetSomeObject();
var span = obj.GetSpan();
obj.Dispose();
var value = span[0]; // AV?

In general, AVs should only be possible within an unsafe block or when using unsafe-equivalent code. For the above pattern, it might be fixed by deferring the dispose as shown below. This pattern would still rely on the memory range sidecar table mentioned earlier.

var obj = GetSomeObject();
var span = obj.GetSpan();
obj.Dispose(); // no-ops since GetSpan() was called and the Span was returned
var value = span[0]; // no AV
// we now just have to wait for obj's finalizer to kick in
mjsabby commented 4 years ago

@GrabYourPitchforks Frozen Objects do that by setting up GC segments with arbitrary ranges and set a flag that says the heap is read only.

I don't know if much of that could be repurposed. Also I don't know how these segment additions scale. What happens if you have 1000 segments?

GrabYourPitchforks commented 4 years ago

What happens if you have 1000 segments?

Definitely a good question for GC folks. @Maoni0, thoughts? I'm just pulling ideas out of thin air and haven't thought any of this through. :)

mjsabby commented 4 years ago

@GrabYourPitchforks assuming checking the X (1000 in the above comment) ranges in addition to stack scan that already occurs for by-ref types is prohibitive, one mitigation could be that this is only done for Gen2 GCs or something. This would of course be in addition to the basic pay for play check where we wouldn't even enter this code path if there are no active memory mapped files that were mapped using this API.

mjsabby commented 3 years ago

@richlander @jkotas I said I would tag you on the issue

This is a sorely missing capability only the runtime can solve. Provable safe memory mapped files very much like by-ref types or a similar mechanism to Frozen Objects of setting up new segments. Need assessment on how bad the perf impact if there are 1000s of such things.

jkotas commented 3 years ago

obj.Dispose(); // no-ops since GetSpan() was called and the Span was returned

The check to turn Dispose into no-op here would require suspending the runtime and doing stack walk on all threads. It would make the Dispose very expensive.

I am not convinced that this design works.

IS4Code commented 3 years ago

I think this overcrowding of callback APIs can be mitigated with some (arguably) clever use of await:

public class Spanwaitable<T> : INotifyCompletion, IDisposable
{
    readonly ThreadLocal<bool> inside = new ThreadLocal<bool>();

    public Spanwaitable<T> GetAwaiter()
    {
        return this;
    }

    public void OnCompleted(Action continuation)
    {
        // check not disposed

        inside.Value = true;
        try{
            continuation();
        }finally{
            inside.Value = false;
        }
    }

    public void Dispose()
    {
        if(!inside.Value) throw new InvalidOperationException();
        // also check if not inside on another thread
    }

    public bool IsCompleted => inside.Value;

    public Span<T> GetResult()
    {
        if(!inside.Value) throw new InvalidOperationException();
        return new T[10].AsSpan(); // produce the span here
    }
}

After all, C# already knows how to turn the code into a state machine to use for the continuation; the only downside is that you cannot create ref struct locals in async methods at the moment. I don't see a reason for that not to be possible forever however.

msedi commented 2 years ago

Just some thoughts on my current workings with memory mapped files, here's also some reference to some proposals on MMFs #59776. I'm here to find out better options to work with MMFs and just asking some critical questions and are of interest to me in general.

I'm not exactly sure what the problem in general is when the MMF got disposed. Isn't that the case with many C# "structures" that you might have a reference to an item that lives inside a PARENT structure and if the PARENT structure gets disposed you ultimately get a ObjectDisposedException when accessing the item?

It seems to me that this is a well accepted behavior, and MemoryMappedFiles are internally unmanaged objects that are mostly created once, you get a ViewAccessor/Stream (depending on if you map the full view or only parts) and if you close/dispose the accessor it is already an existing behavior that the SafeHandle is also disposed, right? So you already have this problem with the ViewAccessors and ViewStreams, if the MMF gets disposed the ViewAccessor/ViewStreams are also invalid.

In the end if the MemoryMappedFile may only be closed if there are no references anymore, the MMF needs to keep track of its created references (and maybe also needs a reference counting)?

I'm mosty using the CreateViewAccessor and write my own wrappers around the MemoryMappedViewAccessor and the PointerOffset and this works as expectected. Extension from the library would be neat, though.

Another thing I don't really understand why it needs GC management. The MMF is already managed by the (windows) memory manager, why should the GC keep track of objects he doesn't even know? I admit that I also already had some OOM problems because the GC and the memory manager do not "communicate" if it comes to MMF

I agree with @GrabYourPitchforks

If you forget to put the using keyword in front of that, your application could AV

that missing a using could cause problems, but I have brought this argument many times in some other discussions and I was always directed to use an analyzer (which I still don't agree) but this seems to a common standpoint from the C# language team.

Another problem with the Span is that it's 32bit in size only. So this only works for sections that fit into this size. If you are working with larger files this is going to get a problem.