Custom allocators - (size, disposal, pools, etc).

ayende commented 9 years ago

One of the hardest things that we have to handle when writing server application or system software in .NET is the fact that we don't have good control over memory. This range from simple inability to state "how big is this thing" to controlling how much memory we'll use for certain operations.

In my case, working on RavenDB, there are a lot of operations that require extensive memory usage, over which we have little control. The user can specify any indexing function they want, and we'll have to respect that. When doing heavy indexing, that can cause several issues. In particular, it means that we generate a LOT of relatively short term data, during which other operations also run. Because we are system software, we are doing a lot of I/O, which require pinning memory.

The end result is that we may have memory with the following layout.

[useful objects] [ indexing garbage ] [pinned buffers doing i/o] [ indexing garbage] [ pinned buffers ]

That result in high memory fragmentation (on Gen0, mostly), which is hard to deal with.

It also means that when indexing is done, we have to cleanup quite a lot of garbage, and because the indexing garbage is mixed with stuff that is used right now, it either cannot be readily allocated or require a big compaction.

It would be great if we had greater control over memory usage. Being able to define a heap and instruct the CLR to allocate objects from it would be wonderful. We won't have request processing memory intermixed with background ops memory, and we have a good way to free a lot of memory all at once.

One option would be to do something like this:

using(var heap = Heap.Create(HeapOptions.None, 
    1024 * 1024, // initial size
    512 * 1024 * 1024)) // max size
{

  using(heap.AllocateFromMe())
  {
     var sb = new StringBuilder();
     for(var i = 0; i < 100; i ++ )
           sb.AppendLine(i);
     Console.WriteLine(sb.ToString());
  }
}

This will ensure that all allocations inside the scope are allocated on the new heap. The GC isn't involved in collecting items from this heap at all, it is the responsibility of the user to take care of that, either by explicitly freeing objects (rare) or by disposing the heap.

Usage of references to the heap after it is destroyed won't be allowed.

Alternatively, because that might be too hard / complex. Just having a way to do something like:

 heap.Allocate<MyItem>();

Would be great. Note that we can do the same right now by allocating native memory and using unsafe code to get a struct pointer back. This works, but very common types like arrays or strings cannot be allocated in this manner.

Having support for explicit usage like this would greatly alleviate the kind of gymnastics that we have to go through to manage memory.

whoisj commented 9 years ago

:+1: to more memory control for those that need it. I'd be happy with a way to managed memory very coarsely at the heap level. Not at useful as destructible or own values, but it is something.

omariom commented 9 years ago

What if objects of the normal heap have references to objects in a user controlled heap? What will happen after the user heap is freed?

ayende commented 9 years ago

Crash? Null reference?

This isn't meant to be something you would do lightly. And you can take responsibility for that On Jul 14, 2015 6:49 PM, "Omari Omarov" notifications@github.com wrote:

What if objects of the normal heap have references to objects in a user controlled heap? What will happen after the user heap is freed?

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121287169.

redknightlois commented 9 years ago

This was mentioned also by Miguel de Icaza a couple of years ago. http://tirania.org/blog/archive/2012/Apr-04.html

@omariom We have already some sort of similar mechanism with weak references. They are nullified. This is certainly not for everyone, but it does have its uses in certain niches (even at the peril of introducing subtle bugs).

OtherCrashOverride commented 9 years ago

In C++, this is called "placement new".

In C#, the main solutions to this are object pools, IDisposable, and IEnumerable. Object pools prevent the allocation of new memory so also avoid the fragmentation. IDisposable allows explict control over the release of non-managed resources or pinned memory. Finally, IEnumerable allows you to allocate/deallocate on demand instead of all at once.

As an alternative, you can create a native library to handle resources from a non-managed heap rather than pinning. The library would then have a C# facade over it that wrapps P/Invoke calls.

ayende commented 9 years ago

Yes, I'm aware of all of those. And none of those really work for those cases.

Consider a highly simplify case of needing to read record from a CSV file and do some work on them. Each line we read becomes garbage very quickly. Now, assume that the file is large, processing time is long, and while some data can be discarded immediately (the line we just read), some data is kept for the duration of the entire file run.

Now, assume that you have to process many such files concurrently.

There is no way for us to control the memory usage. I would like to say "this process cannot take more than 1GB", and I would like to actually get an error if we exceed that, because this gives be more predictable behavior.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 11:48 AM, OtherCrashOverride < notifications@github.com> wrote:

In C++, this is called "placement new".

In C#, the main solutions to this are object pools, IDisposable, and IEnumerable. Object pools prevent the allocation of new memory so also avoid the fragmentation. IDisposable allows explict control over the release of non-managed resources or pinned memory. Finally, IEnumerable allows you to allocate/deallocate on demand instead of all at once.

As an alternative, you can create a native library to handle resources from a non-managed heap rather than pinning. The library would then have a C# facade over it that wrapps P/Invoke calls.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121533267.

ayende commented 9 years ago

And while I can do this in native code, the work I'm doing is primarily managed stuff. The only "resource" that I want to manage is memory itself.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 11:48 AM, OtherCrashOverride < notifications@github.com> wrote:

In C++, this is called "placement new".

In C#, the main solutions to this are object pools, IDisposable, and IEnumerable. Object pools prevent the allocation of new memory so also avoid the fragmentation. IDisposable allows explict control over the release of non-managed resources or pinned memory. Finally, IEnumerable allows you to allocate/deallocate on demand instead of all at once.

As an alternative, you can create a native library to handle resources from a non-managed heap rather than pinning. The library would then have a C# facade over it that wrapps P/Invoke calls.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121533267.

davidfowl commented 9 years ago

Sounds effectively like java nio's ByteBuffer http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html. We've also been looking into something like this for ASP.NET 5. The current plan is to allocate native memory and memcpy user bytes into it.

ayende commented 9 years ago

How are you going to make that work with Stream ? That accept byte[], not byte*.

For specific things, we can use direct methods (ReadFile, WriteFile) that can accept it. But for many things, that isn't an option.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 1:03 PM, David Fowler notifications@github.com wrote:

Sounds effectively like java nio's ByteBuffer http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html. We've also been looking into something like this for ASP.NET 5. The current plan is to allocate native memory and memcpy user bytes into it.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121563950.

davidfowl commented 9 years ago

Stream.Write will be backed by native memory. So when the user passes the managed byte[], it'll be copied into the native buffer.

OtherCrashOverride commented 9 years ago

@davidfowl How is that different from MemoryStream? Should we just expose the IntPtr (byte*) of the storage the class uses? Can it all be done with Marshal.AllocHGlobal in managed code?

https://msdn.microsoft.com/en-us/library/system.io.memorystream%28v=vs.110%29.aspx

OtherCrashOverride commented 9 years ago

Now, assume that the file is large, processing time is long, and while some data can be discarded immediately (the line we just read), some data is kept for the duration of the entire file run.

You may also want to consider MemoryMappedFile: https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile_methods%28v=vs.110%29.aspx

ayende commented 9 years ago

Imagine that I'm doing :

new GzipStream(new SslStream(new NetworkStream(socket))).Read(...)

What happens now? Also note that if you have a NativeMemoryStream or something like that, it would still need to allocate large byte arrays for the rest of the system.

We run into this frequently when using web sockets and long running requests.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 1:47 PM, David Fowler notifications@github.com wrote:

Stream.Write will be backed by native memory. So when the user passes the managed byte[], it'll be copied into the native buffer.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121573852.

ayende commented 9 years ago

@OtherCrashOverride

There is already UnamangedMemoryStream ( https://msdn.microsoft.com/en-us/library/system.io.unmanagedmemorystream(v=vs.110).aspx )

That doesn't help when your destination is an I/O stream.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 1:58 PM, OtherCrashOverride < notifications@github.com> wrote:

@davidfowl https://github.com/davidfowl How is that different from MemoryStream? Should we just expose the IntPtr (byte*) of the storage the class uses? Can it all be done with Marshal.AllocHGlobal in managed code?

https://msdn.microsoft.com/en-us/library/system.io.memorystream%28v=vs.110%29.aspx

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121579824.

ayende commented 9 years ago

I intentionally gave a simple CSV file example, because it easy. The real example is getting data from users and indexing them.

But even with simple CSV file. I cannot allocate a string in the memory mapped file. So I need to read a bunch of bytes, then create a new string. I have no control over the size, where it is located, etc.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 2:06 PM, OtherCrashOverride < notifications@github.com> wrote:

Now, assume that the file is large, processing time is long, and while some data can be discarded immediately (the line we just read), some data is kept for the duration of the entire file run.

You may also want to consider MemoryMappedFile:

https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile_methods%28v=vs.110%29.aspx

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121583484.

redknightlois commented 9 years ago

@OtherCrashOverride Having been on both sides of the fence, doing GPU computing and working inside a database engine with @ayende I can confirm first hand they are 2 very different beasts (what works in one wont work with the other).

In GPU land you can be very liberal about pinning memory. With a bus throughput in excess of 5Gb/s your pinning time will be measured in the microseconds. On database and/or web server land that is not true, we are speaking of at least 2 order of magnitude difference when you go to the extreme case as pointed with the network transfer. Mind you, even going to disk will be measured in high miliseconds land for big buffers (which I have seen in the wild ;) ).

This is an example of how the Gen0 looks like when such things happen. Since then we have been able to confirm what was an hypothesis at the time of writing. http://ayende.com/blog/170243/long-running-async-and-memory-fragmentation

This will become even worse with HTTP/2 where connections will be reused and facilities for long-lived connections are going to be common (not a hack like now). I couldnt find it for illustration purposes but I had seen code in Katana aimed to promote buffers to Gen2 repeateadly executing Gc.Collect() at startup to at least force that memory to go up in generations. The problem is that then you have a supply of fixed amount of memory, and if you consume it all, you are done. So you have to get memory in advance that you will either not use, or you wont have enough if work pattern changes.

redknightlois commented 9 years ago

@davidfowl Constantly copying from the managed to the native copy to then do what the stream has to do will introduce extra memory bus pressure. Wouldnt that introduce a perf regression when you are dealing with big streams/source arrays?

OtherCrashOverride commented 9 years ago

http://ayende.com/blog/170243/long-running-async-and-memory-fragmentation

That link and the comments were very helpful. From what has been stated so far the issue is that the framework is doing the pinning during the async socket operation, not the user. The solution to this may be to modify the socket API in someway or tune its internal operation to prevent it from keeping the memory pinned.

(Is the System.Net.Sockets API even available in CoreCLR at this point? https://github.com/dotnet/corefx-progress/blob/master/src-diff/README.md)

Wouldnt that introduce a perf regression when you are dealing with big streams/source arrays?

I was wondering the same thing; however, the basis for measurement is going to be whether doing multiple copies with short pin times is faster than doing no copies with long pin times overall.

ayende commented 9 years ago

@OtherCrashOverride

I don't see a way for the framework to avoid pinning. Ideally, you are doing DMA by letting the hardware do the operations on specific memory. location.

Even if it isn't actually DMA, it behave very much in the same sense, that eventually you are down to some native function that takes a pointer, and you need that to remain fixed while the operation is running.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 5:30 PM, OtherCrashOverride < notifications@github.com> wrote:

http://ayende.com/blog/170243/long-running-async-and-memory-fragmentation

That link and the comments were very helpful. From what has been stated so far the issue is that the framework is doing the pinning during the async socket operation, not the user. The solution to this may be to modify the socket API in someway or tune its internal operation to prevent it from keeping the memory pinned.

(Is the System.Net.Sockets API even available in CoreCLR at this point? https://github.com/dotnet/corefx-progress/blob/master/src-diff/README.md)

Wouldnt that introduce a perf regression when you are dealing with big streams/source arrays?

I was wondering the same thing; however, the basis for measurement is going to be whether doing multiple copies with short pin times is faster than doing no copies with long pin times overall.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121634209.

OtherCrashOverride commented 9 years ago

I don't see a way for the framework to avoid pinning.

I have some theories that would require testing and benchmarking. Principally, introducing an intermediary native buffer owned by the socket and doing a memcpy to the destination buffer when it fills. No matter how much network I/O you have, there are natural breaks in the data (MTU) and memcpy will be faster than wirespeed in many cases. So there are lots of areas to explore to see what turns out to be performant and what is not. Of course, there actually needs to be a System.Net.Sockets before any experiments can be done.

ayende commented 9 years ago

That would be extremely costly. And it would only be relevant for sockets, we have a lot of I/O work that might have this issue.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 5:43 PM, OtherCrashOverride < notifications@github.com> wrote:

I don't see a way for the framework to avoid pinning.

I have some theories that would require testing and benchmarking. Principally, introducing an intermediary native buffer owned by the socket and doing a memcpy to the destination buffer when it fills. No matter how much network I/O you have, there are natural breaks in the data (MTU) and memcpy will be faster than wirespeed in many cases. So there are lots of areas to explore to see what turns out to be performant and what is not. Of course, there actually needs to be a System.Net.Sockets before any experiments can be done.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121637595.

OtherCrashOverride commented 9 years ago

Another possibility (from the GPU world) is to 'double/tripple buffer'. In this theory, the socket would be filling one array (pinned for duration of operation) while a previously filled array (non pinned) is used by the app. The alternating of the pinned and non-pinned array give the GC the opportunity to move them. Since an array is an reference type, you are simply changing the reference used and not copying or moving any data.

whoisj commented 9 years ago

Another possibility (from the GPU world) is to 'double/triple buffer'.

Multi-buffering is done to avoid use of multi-processor locks and to assist with latency hiding. The basics are: consume enormous amounts of resource because the user only cares about speed and fidelity.

I'm wholly sure how it applies here.

OtherCrashOverride commented 9 years ago

I'm wholly sure how it applies here.

Because as mentioned it allows the GC to relocate a buffer during the time the other is pinned and waiting to be filled. This solves the long run pinning issue and applies to not only network sockets but any type of IO. Alternatively, the OP can wait for Microsoft to approve and implement the change that allows CoreCLR to allocate managed objects from an arbitrary heap space.

consume enormous amounts of resource

Are people really allocating terabyte buffers for networking?

because the user only cares about speed and fidelity.

I believe that is the reason we are having this discussion to begin with.

redknightlois commented 9 years ago

@OtherCrashOverride Probably not terabytes, but small buffers add up pretty fast if the API is not implemented properly. Case in point: https://github.com/dotnet/corefx/issues/1991

ayende commented 9 years ago

@OtherCrashOverride That would only work if the I/O operations are short. In practice, it is perfectly normal for an I/O operation to takes many seconds. For example, if we are just listening on a socket.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 6:00 PM, OtherCrashOverride < notifications@github.com> wrote:

Another possibility (from the GPU world) is to 'double/tripple buffer'. In this theory, the socket would be filling one array (pinned for duration of operation) while a previously filled array (non pinned) is used by the app. The alternating of the pinned and non-pinned array give the GC the opportunity to move them. Since an array is an reference type, you are simply changing the reference used and not copying or moving any data.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121644160.

ayende commented 9 years ago

@OtherCrashOverride also note that this isn't just about buffers. OverlappedData is also a common issue here.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 6:00 PM, OtherCrashOverride < notifications@github.com> wrote:

Another possibility (from the GPU world) is to 'double/tripple buffer'. In this theory, the socket would be filling one array (pinned for duration of operation) while a previously filled array (non pinned) is used by the app. The alternating of the pinned and non-pinned array give the GC the opportunity to move them. Since an array is an reference type, you are simply changing the reference used and not copying or moving any data.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-121644160.

OtherCrashOverride commented 9 years ago

That would only work if the I/O operations are short. In practice, it is perfectly normal for an I/O operation to takes many seconds.

The longer the operation takes to complete, the more opportunity the GC has to relocate the other buffer. When the buffers switch places being locked, the GC then has the opportunity to relocate the other buffer too. The result is that over time, both buffers are moved and no longer cause a fragmentation issue.

The suggestions were offered in the hope they would be helpful. If they are not of benefit, you may simply disregard them.

jkotas commented 9 years ago

BTW: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/Common/PinnableBufferCache.cs is a helper designed to deal with the buffer pinning problem. It has methods to explicitly allocate and free buffers. Internally, it manages free buffers in a GC friendly way to avoid problems with pinned buffers described above.

Socket implementation is using it as well.

ayende commented 9 years ago

Yes, we are doing something similar, by pre-allocating buffers and trying to gen them to Gen2 quickly. We would love to be able to use this directly, but it is internal :-( But this isn't just about pinned buffers, in general more control over allocations would be great.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Tue, Jul 21, 2015 at 8:44 AM, Jan Kotas notifications@github.com wrote:

BTW: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/Common/PinnableBufferCache.cs is a helper designed to deal with the buffer pinning problem. It has methods to explicitly allocate and free buffers. Internally, it manages free buffers in a GC friendly way to avoid problems with pinned buffers described above.

Socket implementation is using it as well.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-123169425.

jkotas commented 9 years ago

You could just include the PinnedBufferCache.cs file in your project - it is not that much code. It is what we have been doing in .NET Framework so far.

@brianrob Have we considered turning PinnedBufferCache.cs into something more official?

brianrob commented 9 years ago

Yes, we have discussed making PinnableBufferCache more official. We just haven't gotten there yet - it is in the plans.

benaadams commented 8 years ago

Specific NUMA node allocation would be helpful; though probably requires more easy thread affinity to make use of it.

mgravell commented 8 years ago

Have you seen the corefxlab stuff? In particular System.Buffers; this is all just speculative ideas, of course, not actual production stuff: https://github.com/dotnet/corefxlab/tree/master/src/System.Buffers

clrjunkie commented 8 years ago

We and every other GC language out there have been working around this problem for the past decade.

http://blogs.msdn.com/b/brada/archive/2003/08/08/50218.aspx

The performance penalty is not only due to memory fragmentations but also because managed byte arrays are zeroed out upon allocation.

I always find it hilarious when people reference insane benchmarks results of Web framework that do custom memory allocation on pre-allocated byte arrays for processing 500byte requests but fall apart like a deck of cards (tested) once business logic processing is introduced that actually requires doing something useful with the data (like passing it to another class :) or when traffic needs to be decrypted from SSL and COPIED to user buffers.

Don’t get me wrong if you want to build a router there are actually very good frameworks out there for the task but let’s be clear – router.

if you got to place where you need to do manual memory management, sorry “buffer pooling” than why not consider C++? I think “new” and “delete” already do a better job than “push” and “pop” or whatever, plus you will probably get far less INTERNAL fragmentation than any "off the top of head invented heap"; C++ already has constructs in place to ease the pain for doing memory management where in C# you have practically nothing to guard you from buffer pool leaks; alternatively, if C++ language syntax is the problem than let’s do "C# native edition" - game over.

ayende commented 8 years ago

Because while for the most part, I get quite a lot from running in a managed context, being able t control where I'm allocating things will give me much better handling for a lot of hard scenarios. Note that what I would really like is to define a custom heap, and just drop the whole thing in one shot. That would be much easier and reduce the time we spent in GC signficnatly.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Mon, Nov 23, 2015 at 6:01 PM, clrjunky notifications@github.com wrote:

We and every other GC language out there have been working around this problem for the past decade.

http://blogs.msdn.com/b/brada/archive/2003/08/08/50218.aspx

The performance penalty is not only due to memory fragmentations but also because managed byte arrays are zeroed out upon allocation.

I always find it hilarious when people reference insane benchmarks results of Web framework that do custom memory allocation on pre-allocated byte arrays for processing 500byte requests but fall apart like a deck of cards (tested) once business logic processing is introduced that actually requires doing something useful with the data (like passing it to another class :) or when traffic needs to be decrypted from SSL and COPIED to user buffers.

Don’t get me wrong if you want to build a router there are actually very good frameworks out there for the task but let’s be clear – router.

if you got to place where you need to do manual memory management, sorry “buffer pooling” than why not consider C++? I think “new” and “delete” already do a better job than “push” and “pop” or whatever, plus you will probably get far less INTERNAL fragmentation than any "off the top of head invented heap"; C++ already has constructs in place to ease the pain for doing memory management where in C# you have practically nothing to guard you from buffer pool leaks; alternatively, if C++ language syntax is the problem than let’s do "C# native edition" - game over.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-158979616.

masonwheeler commented 8 years ago

What I would really like is to define a custom heap, and just drop the whole thing in one shot. That would be much easier and reduce the time we spent in GC significantly.

That's an interesting idea, but how would the "dropping" work? Seems to me there are only two possibilities: it gets dropped when there are no live references to anything in the heap, in which case we're not taking the GC out of the picture afterall, or it gets dropped when you call the .Drop() method, in which case you've just introduced the concept of dangling references into what used to be a memory-safe environment. It doesn't seem like either one is a good solution.

ayende commented 8 years ago

The idea is that we can explicitly drop the heap, yes. And yes ,that comes with dangling pointers, but anyone doing any sort of complex work that require it is already in a position where they are working with unsafe code (native calls, unmanaged memory, etc), so that doesn't change much from that regard.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Tue, Nov 24, 2015 at 2:56 PM, masonwheeler notifications@github.com wrote:

What I would really like is to define a custom heap, and just drop the whole thing in one shot. That would be much easier and reduce the time we spent in GC significantly.

That's an interesting idea, but how would the "dropping" work? Seems to me there are only two possibilities: it gets dropped when there are no live references to anything in the heap, in which case we're not taking the GC out of the picture afterall, or it gets dropped when you call the .Drop() method, in which case you've just introduced the concept of dangling references into what used to be a memory-safe environment. It doesn't seem like either one is a good solution.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-159259661.

clrjunkie commented 8 years ago

I think what you would really like is to give the GC more opportunity to compact Gen0 by reducing the time memory is pinned while an async operation is pending… in other words you want to get a notification when data has been received or ready to be sent so you can copy the data into or from your own buffers as opposed to handing them out in advance (a.k.a event loop)

ayende commented 8 years ago

That would be nice, but that isn't actually what I want. Consider a work that is done that require a lot of memory. Processing a file, for example. Once that file processing is done, all the memory used in processing it is available for freeing at once. Processing a file can take multiple minutes, resulting in a lot of the memory ending up in Gen1 / Gen2.

I would like to be able to just kill all of that memory at once. It would also be useful to be able to limit how much memory I am using on a particular file, because right now, there is very little that I can do to even account for that.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Tue, Nov 24, 2015 at 7:59 PM, clrjunky notifications@github.com wrote:

I think what you would really like is to give the GC more opportunity to compact Gen0 by reducing the time memory is pinned while an async operation is pending… in other words you want to get a notification when data has been received or ready to be sent so you can copy the data into or from your own buffers as opposed to handing them out in advance (a.k.a event loop)

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-159356760.

clrjunkie commented 8 years ago

Why do you need the entire file in memory for processing? Are you resizing bitmaps? Why not process it in chunks? Why do you need to keep the memory pinned while you process it? What’s the problem with calling GC.Collect() when you actually need to?

My suggestion was based on your comment from dotnet/coreclr#1236:

“Example, socket.ReceiveAsync() this is going to be a pending operation (with pinned I/O) until the user send some data to us. This can be 15 seconds, or it can be a few hours, depending on the scenario.”

benaadams commented 8 years ago

You can allocate large chunks in the LoH; which you then partition for what are effectively custom allocators.

ayende commented 8 years ago

Imagine that I'm reading a large json file. It is full of records that I need to process. Because I have a pipelines with multiple targets, and there is I/O costs associated with each target, I'm processing things in batches. So I load 128K records, process them (which has its own memory costs), then I can discard the whole thing because I'm done with it and the results are persisted.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Nov 25, 2015 at 3:13 PM, clrjunky notifications@github.com wrote:

Why do you need the entire file in memory for processing? Are you resizing bitmaps? Why not process it in chunks? Why do you need to keep the memory pinned while you process it? What’s the problem with calling GC.Collect() when you actually need to?

My suggestion was based on your comment from dotnet/coreclr#1236 https://github.com/dotnet/coreclr/issues/1236:

“Example, socket.ReceiveAsync() this is going to be a pending operation (with pinned I/O) until the user send some data to us. This can be 15 seconds, or it can be a few hours, depending on the scenario.”

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-159603515.

clrjunkie commented 8 years ago

Sounds like your "batch" boundry is a file and you don't have a json parser that can parse file chunks.. what if you get a json file with 100 million json records? Seems like you have parser problem...

ayende commented 8 years ago

The actual problem isn't a parser. We are reading documents from disks, and we need to do this in batches, the cost per batch is pretty high, so we increase the number of documents to decrease the cost per document.

Hibernating Rhinos Ltd

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Nov 25, 2015 at 4:36 PM, clrjunky notifications@github.com wrote:

Sounds like your "batch" boundry is a file and you don't have a json parser that can parse file chunks.. what if you get a json file with 100 million json records? Seems like you have parser problem...

— Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/1235#issuecomment-159624848.

Luaancz commented 8 years ago

@OtherCrashOverride

Because as mentioned it allows the GC to relocate a buffer during the time the other is pinned and waiting to be filled.

The main problem with this is that this is not how the .NET GC works. Right now, it will either compact the whole heap, or not. It doesn't try to relocate objects that aren't pinned while not touching the objects that are pinned - if there's a pinned object in your heap, the heap will not be compacted. There's still some attempts by the GC to make this work, but the only thing it can do at this point is create more heaps.

On our (originally .NET 2.0) socket server, this resulted in brutal memory fragmentation, with hundreds of individual heaps that were 99% empty, with a pinned buffer or two. There are ways to avoid this now (using pre-allocated buffer pools and making sure they are big enough to be allocated on the LOH), but it's a huge pain.

It's not something trivially fixable. .NET allocations are almost free - you always allocate new objects on top of the heap (as if it were a stack), never in the middle (The exception is the LOH, of course, which is why you can get real benefits from ensuring all your pinned buffers are on the LOH - but that can get expensive real quick). Of course, to make this allocation model work, the heap must be compacted periodically, to squeeze out the free space. Pinning prevents that.

The other cool trick is that due to the stack-like + compacting heap model, you usually get great data locality - something that's very important with a heavily object (and function)-oriented environment like .NET. Indirection often means "look 40 bytes to the left", which can save quite a bit of cache trashing.

Enabling the option to create your own heaps and force allocation to them would help a lot, even if you didn't break the memory safety as was Oren's original suggestion. The biggest problem with breaking the memory guarantees is that unlike unsafe code (and weak references), there's no way to explicitly show that this reference in particular is potentially unsafe - and no way check if the reference is still safe for use right now (or in the middle of a method call on that object).

A nice middle ground might be a set of APIs that are built closer to native - strings that can be stackallocated, stream APIs that use byte* instead of byte[] etc. Of course, if you can sacrifice portability, you can do this yourself even now - as long as you only keep those for the critical paths, it's pretty manageable. It does mean some code duplication, but again, as long as you keep things small, it's not a big problem. Maybe there even is some library that allows you to do this?

dduerner-ycwang commented 7 years ago

@Luaancz

"A nice middle ground might be a set of APIs that are built closer to native - strings that can be stackallocated, stream APIs that use byte* instead of byte[] etc."

@ayende

"I would like to be able to just kill all of that memory at once."

"So I load 128K records, process them (which has its own memory costs), then I can discard the whole thing because I'm done with it..."

We weren't sure if this issue thread is still active or not. Or, if this is even still a problem or not.

We wanted to share our MemoryStreamNoLOH class just in case it might be useful in any way. Don't know if this was the kind of thing you were talking about, or not even close...just trying to help.

The MemoryStreamNoLOH class implements the stream interface (deriving from the base class Stream). It has:

stream.Read(byte[] ...) stream.Read(byte ...) stream.Write(byte[] ...) stream.Write(byte ...)

It's backed by unmanaged memory. A list of unmanaged segment blocks of memory. The segments are adjustable; you could have one gigantic segment block (like 1GB for example), or many small segment blocks (like 128KB for example). The segments don't have to be contiguous. The read's and write's do modulus essentially wrapping from the end of a segment back to the beginning of the next segment, basically kind of representing the memory as one contiguous block of memory.

It automatically expands; it's not a fixed size. The write's will automatically add additional segments when needed.

When it's disposed (like in a using block), all the memory is immediately released and returned to the OS. It basically gives you pretty good control over the memory (i.e. deterministic end).

The memory backing the stream is completely outside the control of the GC. Being outside the GC can eliminate the pinning problem in some cases. But, the nice thing is the stream class still enjoys the benefit of a garbage collected language (since the GC dispose's it, so there's no leaks).

We originally created it for binary serialization stuff, but it can also be used as a sort of bridge between the managed and unmanaged world for example:

byte[] bytes = Encoding.ASCII.GetBytes("hello");
fixed (byte* b = &bytes[0])
{
    using (MemoryStreamNoLOH ms = new MemoryStreamNoLOH())
    {
        ms.Write(b, 0, bytes.GetLength(0));

        ms.Position = 0;
        byte[] b2 = new byte[ms.Length];
        ms.Read(b2, 0, b2.GetLength(0));

        Debug.WriteLine(Encoding.ASCII.GetString(b2));
    }
}

IntPtr ptr = IntPtr.Zero;
try
{
    byte[] bytes = Encoding.ASCII.GetBytes("hello");
    fixed (byte* b = &bytes[0])
    {
        ptr = Win32.VirtualAlloc(IntPtr.Zero, (IntPtr)(1024 * 64), Win32.MEM_RESERVE | Win32.MEM_COMMIT, Win32.PAGE_READWRITE);

        Win32.MoveMemory(ptr, (IntPtr)b, (IntPtr)bytes.GetLength(0));
        string s = Encoding.ASCII.GetString((byte*)ptr, bytes.GetLength(0));

        using (MemoryStreamNoLOH ms = new MemoryStreamNoLOH())
        {
            ms.Write((byte*)ptr, 0, bytes.GetLength(0));

            ms.Position = 0;
            byte[] b2 = new byte[ms.Length];
            ms.Read(b2, 0, b2.GetLength(0));

            Debug.WriteLine(Encoding.ASCII.GetString(b2));
        }
    }
}
finally
{
    if (ptr != IntPtr.Zero)
        Win32.VirtualFree(ptr, IntPtr.Zero, Win32.MEM_RELEASE);
}

using (NetworkStream stream = socket.GetStream())
{
    using (MemoryStreamNoLOH ms = new MemoryStreamNoLOH())
    {
        ...

        ms.Position = 0;
        ms.CopyTo(stream);
    }
}

If you want to look at it, the MemoryStreamNoLOH class is inside the NoLOH library download at:

https://www.codeproject.com/Articles/1191534/To-Heap-or-not-to-Heap-That-s-the-Large-Object-Que

Also, the NoPin library download that's there lets you have a private unmanaged heap (Low-Fragmentation Heap) inside a C# program (basically letting you have a private heap outside the control of the GC that doesn't need pinning).

Don't know if this is useful or possibly sparks any new ideas...just thought we would share.

mattwarren commented 7 years ago

It seems like you may be able to achieve (some of) this if/when the work being done in the Snowflake project arrives in CoreCLR, see Project Snowflake: Non-blocking safe manual memory management in .NET July 26, 2017 for more info.

The code sample below is from the paper, if shows the usages of Shield<T> which implies that the allocation is on a different heap (i.e. no GC) and can be cleaned up when it's safe to do so:

T Find(Predicate<T> match) 
{
    using (Shield<T[]> s_items = _items.Defend())
    {
        for (int i = 0; i < _size; i++) 
        {
            if (match(s_items.Value[i]))
                return s_items.Value[i];
        } 
    }
    return default(T);
}

ZacLiveEarth commented 6 years ago

I'm also suffering a problem with pinned memory that's being used by an unchangeable 3rd party library.

Here's a screenshot from DotMemory:

pinnedmemory

Note in particular Gen 0, with 188 KB used, and 1.54 GB Free.

While I'd love something like Project Snowflake to provide manual memory management in the critical 3%, I wonder if there's a way for the framework to deal with this particular problem. Imagine, for example, that the JIT could statically determine in some cases that an allocation was going to be pinned. Maybe that allocation could go on a "Pinned Object Heap" analogous to the Large Object Heap that specifically exists to make GC more effective on the generational heaps. That may be a good 80% solution that works without any changes to existing code.

denisvlah commented 6 years ago

Check out this post: https://blog.adamfurmanek.pl/2016/05/07/custom-memory-allocation-in-c-part-3/ This guy was able to implement custom memory allocator with a lot of hacks.

All networking server applications (including ASP.NET) would greatly benefit from custom memory allocators. Even gen0 GC can be completely avoided.

The project Snowflake will be a great movement in .NET ecosystem but this is one of the memory allocators. I would also like to see arena allocator that can drop all objects with O(1) complexity and allocate new objects with O(1) complexity. So the .NET should provide API to implement allocators and allow user to use it for it's own responsibility. This would allow to write high performance server apps and not switch to C++

dotnet / runtime

Custom allocators - (size, disposal, pools, etc). #4368