More efficient MemoryAllocator

antonfirsov commented 3 years ago

One way to address the concerns in #1590 is to come up with a new MemoryAllocator that meets the following two requirements even under very high load: (A) Steady allocation patterns, instead of GC fluctuations (B) Lower amount of memory being retained, at least "after some time"

Some ideas to explore:

Allocate native memory over a certain threshold (~1MB) deferring the memory management to the OS 1.1 Consider pooling unmanaged memory, especially if we implement the next point.
Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream. It's worth to check how does this work with both pooled arrays and unmanaged memory.
Have a threshold on the reatained memory, and release some of the pooled arrays (or pooled unmanaged buffers) when we grow over it.
Mentioning an idea I really hate for the sake of completeness: Have a synchronization mechanism around large buffer acquisition, similarly to System.Drawing (WIC?). To do this properly, memory allocation should become an asynchronous operation, otherwise SpinLocks will just add more fuel to the fire.

Point 1. seems to be very simple to prototype. We need an allocator that uses Marshal.AllocHGlobal over some threshold, and an ArrayPool below it, and see how the memory timeline goes with the bee heads MemoryStress benchmark in comparison to ArrayPoolMemoryAllocator.

@saucecontrol any thoughts or further ideas? (Especially on point 2.)

saucecontrol commented 3 years ago

Idea 2 is interesting. I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?

For the record, I also hate idea 4, but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts. The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine... I reckon even in 64-bit you'd start paging.

antonfirsov commented 3 years ago

I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?

Implementing the ability to have discontinous buffers was relatively easy, since our processing code usually goes row-by row. We just had upgrade buffer2d.GetRowSpan(y) so it can take the row from different buffers (buffer boundary is always between rows). Going fully-streamed needs much bigger architectural changes, I would rather start with the "decode into target size" first, and see where we can get from there.

Currently we have a very high limit for images being split to different buffers, what I'm considering is to check what happens if we aggressively lower that limit. I'm unable to assess if it would make thing worse or better.

but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts.

That's why I want to avoid materializing them :) Will open an issue for tracking "decode into target size" stuff tomorrow.

The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine...

I was wondering why do are using the basic Parallel.For overload that scales parallelism with the machine's CPU count? Is this a way of modeling typical server-side multi-threaded stress?

saucecontrol commented 3 years ago

I'm unable to assess if it would make thing worse or better.

I don't think it would make things worse. If I understand right, your current mode of processing is to apply each transform to the entire image before moving on the next? If so, you're kind of trashing the cache at each stage, so it wouldn't matter whether the memory is contiguous or not. I'm not sure it would make things better either, though. You might want to look at your LOH fragmentation under those high memory conditions and see whether being able to stuff a smaller buffer into an LOH hole would gain you anything. Switching to unmanaged buffers may help a bit there, but heap fragmentation can be a problem in unmanaged as well.

Of course being able to aggregate small buffers in place of a single larger one would be of benefit, provided you don't need them at the same time. I assume if that works for RecyclableMemoryStream, it should work for ImageSharp 🤷‍♂️

Is this a way of modeling typical server-side multi-threaded stress?

I must admit I didn't put that much thought into it. But yeah, the point was to make sure there were enough threads running to check whether the libraries were capable of true linear scale-up. As long as there are free processors to hide work on (as Vips does today -- and ImageSharp did before your windowed convolution implementation), you can't be sure. If you think about a high-load web server scenario, you would assume the managed thread pool would fully occupy all processors, so I want to know how a library will perform in that environment -- and to be sure that a large number of image processing operations doesn't make for a self-DoS.

BTW, I suspect the anomalous CPU measurements I got for Vips running under the profiler might have actually come from an oversubscription problem. Could be the CPU overhead of the profiler was just enough to throw the whole system into chaos and drop the speed by several multiples. I plan on doing some followup testing to see if I can nail that down.

saucecontrol commented 3 years ago

Couple of other interesting stress cases you might want to look at.

1) 3400+ frame GIF: http://i.imgur.com/Xv4UKYL.gif 2) Lots of good giant JPEGs available on the ESO site, like: https://www.eso.org/public/usa/images/eso1719a/

antonfirsov commented 3 years ago

The reason I'm thinking about "discontiguous memory by default" is not cache coherence, but the fact that it may enable more efficient pooling, reducing the GC pressure (or OS allocation traffic in case of unmanaged buffers). With large images it is very likely that a the requested buffer will outgrow the pool deferring to GC/OS allocating new giant arrays of non-uniform sizes, which would probably result in heap fragmentation even if we are using unmanaged memory.

I'm thinking of a solution that works as following:

Under 1MB we use a standard ArrayPool (can be ArrayPool<byte>.Shared)
Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.
- This pool might be built of LOH arrays or unmanaged buffers. (Both approaches seem to have pros and cons.)
- Obviously there is an upper limit for the retained/pooled memory
If we outgrow the discontiguous pool, we may consider to defer new allocations to unmanaged memory, so at least we don't stress the GC
There is a logic that periodically shrinks the pool so the solution works well with ImageSharp.Web (which typically places much lower load on the core library after a warmup period, so it doesn't make sense to retain memory in the pools)
- The periods can be synchronized with actual GC calls with a trick similar to Gen2GcCallback in TlsOverPerCoreLockedStacksArrayPool. For example we may halve the retained memory before each gen2 GC.

@saucecontrol any thoughts on this particular plan? I wish I could pull in a GC expert to assess it ...

saucecontrol commented 3 years ago

That plan looks sound to me, but I'm no GC expert. I do expect that inability to grab a contiguous large block of memory due to LOH (or process heap) fragmentation is at play in the 32-bit OOM scenario, but it would be good to verify that before diving in to the work if possible.

It seems like if you end up doing the work to completely manage the allocations, you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that. Plus, you'll probably always have to deal with some allocation requests that will be over whatever pooling threshold you set, and you'll want those to go straight to unmanaged so they are as short-lived as possible.

antonfirsov commented 3 years ago

you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that

The problem is that I don't really know when to free up memory. With LOH + the Gen2Callback trick I can at least rely on GC heuristics with this. (Otherwise it feels like implementing my own GC)

JimBobSquarePants commented 3 years ago

Is this useful?

https://github.com/sakno/dotNext/tree/567ac01b54783a2cb9ef167c6fe603dcb42235f5/src/DotNext.Unsafe/Buffers

antonfirsov commented 3 years ago

@JimBobSquarePants it just allocates / returns unmanaged memory without pooling. Maybe it's good enough (or even preferable) approach with unmanaged memory, not sure.

My main concern about unmanaged memory is that I would be hesitant to use it as a default, because of the users who may implicitly depend on not disposing Image. An update would introduce a reliability / security issue for them instead of their existing performance issue.

saucecontrol commented 3 years ago

Instead of holding the raw IntPtr as a member of the MemoryManager<T>, that should hold a SafeHandle wrapping it. Something like:

internal sealed class SafeHGlobalHandle : SafeHandle
{
    private readonly int byteCount;

    public override bool IsInvalid => handle == IntPtr.Zero;

    public SafeHGlobalHandle(int size) : base(IntPtr.Zero, true)
    {
        SetHandle(Marshal.AllocHGlobal(size));
        GC.AddMemoryPressure(byteCount = size);
    }

    protected override bool ReleaseHandle()
    {
        if (IsInvalid)
            return false;

        Marshal.FreeHGlobal(handle);
        GC.RemoveMemoryPressure(byteCount);
        handle = IntPtr.Zero;

        return true;
    }
}

Since SafeHandle has a finalizer, it won't leak, but you can call Dispose to free it immediately if the MemoryManager<T> is Disposed properly.

It might be worth implementing something like that for any allocations larger than your max pool buffer size just because you know those will happen and will have a large GC cost. It's also a lot easier to trial than the segmented buffer idea.

saucecontrol commented 3 years ago

FYI https://github.com/dotnet/runtime/issues/52098 is tracking some possible scenarios to improve in ArrayPool/GC interaction. I think Jan's scenarios 1 and 3 (and maybe 2 as well) are applicable to ImageSharp.

antonfirsov commented 3 years ago

There are so many things I hate about ArrayPool, especially ArrayPool<T>.Shared which is a library-level public shared global state, and as such something that shouldn't exist at all IMO.

All the magic ArrayPool.Shared is doing should be a runtime/GC implementation detail hidden behind standard new T[] allocations. (Edit: OK, it was mostly ranting, this approach would require an API like GC.UnsafeReleaseArray(...), but but AFAIK array pooling is much less common in Java for example, I assume GC does better job there)

br3aker commented 3 years ago

Hi everybody! Have a couple of thoughts on this project I'd like to share. Thanks for such a good topic on memory performance.

First of all, current state of pooling already supports discontiguous buffers. It's just hidden behind MemoryGroup<T>.Allocate(,,,) call:

First it checks if single stride can fit in single contiguos buffer: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup%7BT%7D.cs#L85

Then it does strange things: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup%7BT%7D.cs#L99-L107

Details aside - this alloc call either throws or pools single contiguous buffer each time. So here goes first problem: most of the time pool is stressed by huge arrays from buckets of higher end of indices when low indices are empty which makes maxArraysPerBucket config value completely irrelevant. Edit: this can't be fixed by simply rewriting MemoryGroup<T>.Allocate(,,,) call, at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption. Haven't checked other processors but I guess it might also be a problem somewhere else.

Edit: huge mistake in calculations and hardcoded upper bound of pool max size (it's hex, not dec), this still implies but for certain pool setups, see post below. But there's a far more complicated problem: pooling doesn't work in certain situations. This is solely due to current implementation of the pool in .net, it silently clamps maximum size of the array:

const int MinimumArrayLength = 0x10, MaximumArrayLength = 0x40000000;
if (maxArrayLength > MaximumArrayLength)
{
    maxArrayLength = MaximumArrayLength;
}

Which leads to unexpected LOH allocations when working with images up to ~40mb of pixel data and that isn't very rare: rgb24 1080p jpeg is 24 * 1920 * 1080 = ~49mb of decoded pixel data. Bee heads test completely ignores pooling because it consists of rgb24 jpeg`s with at least 3000x3000 pixels.

Parallel.Foreach bee heads can skyrocket memory usage (been counting gigabytes of work set during execution) with frequent GC(2) collections.

saucecontrol commented 3 years ago

You've got a couple of translation errors in your analysis. First, the hard-coded MaximumArrayLength = 0x40000000 isn't 40MB -- it's hex, so that's 1GB. Second, RGB24 is 24 bits, or 3 bytes, per pixel. So a decoded 1920x1080 RGB image is 6MB.

But you're right about the LOH allocations. The default memory pool is limited to 24MB per array, so images over ~8 megapixels will cause allocations, and there are definitely images in the bee heads set that exceed that limit.

https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.CommonFactoryMethods.cs#L15

The challenge is in achieving a balance between avoiding GC allocations and holding a high steady-state memory level when large arrays get pooled.

br3aker commented 3 years ago

Oh gosh, you're right, math was not on my side tonight, thanks for pointing out.

Worst part about built-in .net pool is that it was developed for general use-case. Exponential growth of internal buffers would waste a lot of space in certain conditions. Empty buckets are not a big deal but they are just sit there for no reason. Pixel buffer consisting of several unmanaged chunks is a much better choice, plus custom alignment and more reliable memory deallocation.

Although GC.AddMemoryPressure looks counterproductive in this particular situation - we don't want to affect GC at all while using pool.

br3aker commented 3 years ago

IMO all allocations should be divided into 2 categories: pixel data which outlives any other allocation type in context of this library and everything else. Main memory consumers are pixel buffers which cause massive hits on both GC heaps and theoretical custom memory pool based on unmanaged memory - it'll be much harder to support large buffers while allowing to pool small buffers of integers/small structs.

This way we can delegate those small allocations to GC (thx to tiers this won't really affect its performance) while concentrating on storing pixel buffers with proper alignment.

Another interesting concept mentioned by @antonfirsov - async/await for memory pooling. This approach could facilitate environments where memory is crucial. But it might be really tricky not to fall into the deadlock when there won't be enough memory for any of the awaiters. Anyway, I don't think that it's a good solution to avoid locks in pooling logic - some cases simply don't want to work with TPL - calling something like allocator.GetMemoryAsync(size).Result looks ugly and performs even worse than it looks. Asynchronous pooling can use separate memory pool so it won't interfere with synchronized pooling. And I still don't think it's worth spending time implementing this atm, better to concentrate on basic pool implementation.

antonfirsov commented 3 years ago

@br3aker reacting to https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-837742728 first:

this alloc call either throws or pools single contiguous buffer each time.

at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption

If any of these are true, it's a bug. You should be able to limit the maximum contiguous buffer size by using the following full ArrayPoolMemoryAllocator constructor, and setting bufferCapacityInBytes:

https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs#L77-L90

~You can pass the default values for the rest of the parameters.~ -- Except bucket sizes, you need to increase them significantly to make sure the pool is not drained.

Note that the default for bufferCapacityInBytes is a very high value today (equivalent of ~130 Megapixels of RGBA32), so this may explain that you see contiguous buffers without customization.

I believe lowering the capacity + setting a higher bucket size should help with the biggest "pooling doesn't work in certain situations" problem. If you have a repro proving this fails on the load+resize+save path (= contiguous buffer is allocated or exception is thrown) can you please open a new separate issue for that?

antonfirsov commented 3 years ago

Regarding the rest of the comments:

I believe all mentioned issues can be addressed either by a very basic unmanaged allocator or by the "smart" allocator described in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-821146358. None of those would use the BCL ArrayPool<T> for large buffers, and therefore won't suffer from the asymmetric bucket usage problem.

Main memory consumers are pixel buffers

I wish this was the case, but this is not true: #1597. This our top-priority memory issue now, effecting the most common load+resize+save path. Even if we manage to fully optimize it, the "fancy" (blur, sharpen etc.) processors may still keep allocating all kinds of large Buffer2D<T>-s of Width * Height size. I don't see a conceptual difference between pixel buffers backing Image<T> and other internal buffers.

br3aker commented 3 years ago

Oh yeah, I've haven't actually checked all the processors. What I meant is that allocator shouldn't be bothered with small allocations like here: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs#L497-L507

Those buffers are not actually big: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs#L64-L66

https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs#L19-L26

Without proper investigation I thought several places had this kind of small allocations but I was wrong, that's the only place I could find with allocator.Allocate call with small buffers (compared to other allocations). My bad, sorry for the rustle.

https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-838175639 I changed bucket size prior to experimenting with non-contiguos buffers, sorry for not mentioning that. Will try to reproduce and write proper issue/comment today.

hexawyz commented 3 years ago

Hello,

I saw this issue pop in my twitter feed and I was curious about it, I hope you'll forgive me for butting in 👀

I didn't see it mentioned, but at least in the scenario of very large pixel buffers, an option could be to rely on MemoryMappedFile for allocating unmanaged memory in the form of raw memory pages. As an alternative to Marshal.AllocHGlobal, this would avoid fragmenting the native heap (used by AllocHGlobal) and the LOH, at least in a 64 bit process. In a 32 bit process, address space fragmentation could still be a problem.

Similary to AllocHGlobal/FreeHGlobal, the memory you get from MemoryMappedFile is allocated and deallocated immediately. However, the memory pages used by a memory mapped files should be freed pages immediately, where an heap/LOH based approach could keep the pages in memory. (e.g. because they are in the middle of the heap or still partially used)

⚠ One possible drawback could be a performance hit associated with allocating/deallocating the memory pages for the memory mapped file (this would need to be benchmarked), but this should still be worthwile for very large buffers, as a heap-based mechanism would also need to allocate new pages.

Of course, this way of allocating memory could also be used with pooling and discontinous buffers. What it does at the core is give you more control over how much physical memory is allocated, or at least reserved in your address space. It could be worthwile if you were to chose to pool your own unmanaged buffers.

For the sake of completeness, you can find a rough example of usage of MemoryMappedFile to allocate private memory here.

JimBobSquarePants commented 3 years ago

Regarding discontigous buffers. If we went that route we would likely not be able to pin anything. There's a couple of places we do so currently.

JimBobSquarePants commented 3 years ago

@kkokosa @Maoni0 My sincerest apologies for the name drop but since you are both absolute experts in memory management and the GC it would be wonderful if we could have your opinion on the best design approach.

antonfirsov commented 3 years ago

@JimBobSquarePants in the current design MemoryAllocator.Allocate<T> will always allocate a contiguous buffer, while GetBufferCapacityInBytes works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.

@GoldenCrystal using MemoryMapped files is a good idea, but it's the most complex option of all, needs a lot of research, and has to be implemented with care. If possible, I would prefer to go with managed LOH arrays, because anything unmanaged is a breaking change (an application leaking IDisposable Image-s updates to new version of the library introducing a security risk by actually starting to leak unmanaged memory).

antonfirsov commented 3 years ago

In case anyone chooses to chime in, but the thread is TLDR:

I'm thinking about a custom pool of uniform (~2MB) LOH arrays, that can be used build larger discontiguous buffers to back ImageSharp's Image<TPixel> and other very large internal temporary "2D" buffers.

Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient? Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

JimBobSquarePants commented 3 years ago

@JimBobSquarePants in the current design MemoryAllocator.Allocate<T> will always allocate a contiguous buffer, while GetBufferCapacityInBytes works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.

Yep but this had me concerned.

Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream.

Regarding GetBufferCapacityInBytes. ArrayPoolMemoryAllocator currently uses this as more than a hint and throws an InvalidMemoryOperationException if the requested 1D buffer is larger than BufferCapacityInBytes which I'm not currently a fan of.

br3aker commented 3 years ago

Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient?

Honestly, current problems with OutOfMemory exceptions look like a bug rather than real memory management problems. I can't believe that even in parallel tests 8 (or even 16) cores can spend up to 8gb in its peak scenario on such a simple task (source - beeheads demo by @saucecontrol)

Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

One of the simpliests yet the most 'controlable' thing from user perspective you can do (especially good for bulk processing):

using (var context = new Context()) 
{
    using Image img = Image.Load(..., context);
    using Image resized = img.Resize(..., context);
    resized.Save(..., context);
}

Unfortunately, this would lead to a lot of public API changes/additions but it allows to manually control when processing is over and resources can be freed.

Problem with LOH allocations is that LOH can't actually trigger GC as reliably as non-LOH allocations do. We have 2 rather rare scenarions of LOH collection: 1. Low available memory to allocate non-LOH memory -> full GC collection cycle for freeing all memory 2. Low available memory to allocate new LOH buffer for pooling -> full GC collection cycle as gen2 triggers both gen0 and gen1

(1) won't be caused by pooling if we use LOH buffers (2) pool won't allocate new buffers after some time - it would either use buffers 100% for some processing or would have a lot of free buffers

What I wanted to state is that LOH collection is a really rare occasion unless triggered manually which is not what library code can reliably control.

Yes, we can try to force gen2 GC collection at pool buffer returnal:

// somewhere in pool class
public void Return(byte[] buffer) {
    // buffer returnal logic

    usedBuffersCount--;
    if(usedBuffersCount / allocatedBuffersCount < somePercentage) {
        GC.Collect(2);
    }
}

But this should be done with care as gen2 collection could cause a significant performance hit.

JimBobSquarePants commented 3 years ago

Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

GC.GetGCMemoryInfo is .NET Core 3.0+

For older runtimes, no idea currently. Probably a hard limit or configurable value.

br3aker commented 3 years ago

There's also a problem with memory usage patterns.

This is an approximation of how load-resize-save pool buffers usage looks like:

This is an approximation of how 'process something once and forget' use case looks like:

Somehow pooling logic should detect both of these cases and should not clear internal buffers in the first case.

I don't think there's any other solution rather than statistical pooling which would tweak behaviour during runtime based on how memory was pooled/returned over the lifetime of the entire app. Stats can be gathered either during pool/return calls or with equal quants of time using Timer (or both?).

JimBobSquarePants commented 3 years ago

@br3aker I've been experimenting with an implementation that should handle both scenarios in .NET Core 3.1+

https://github.com/SixLabors/ImageSharp/tree/ad96db757cc9b564bfd6db6d738fb648500e3f38

Check out the various ArrayPoolMemoryAllocator partials.

I've tried to keep things simple and have taken on board various suggestions including @saucecontrol -s one regarding using the shared pool for small buffers.

br3aker commented 3 years ago

@JimBobSquarePants unfortunately, there's simply no way to get MemoryLoadBytes HighMemoryLoadThresholdBytes pre GCMemoryInfo going public. Maybe with reflection somewhere deep in the private details of runtimes that info could be gathered but I doubt that it would lead to a good implementation.

Yet another problem with current implementation - race condition. Neither Configuration.MemoryAllocator nor ArrayPoolMemoryAllocator locks in rent/returnal code which can lead to many different entities renting same buffer.

Considering all variables (different formats, use cases, image sizes), chances of this causing problems are very unlikely but it's a very serious problem nonetheless. And that's where Timer comes in - it also needs synchronization. While it wouldn't result in race condition, it would result in a huge memory re-allocation:

allocated but not rented buffers would be dead
allocated and rented buffers would be lost because they have nowhere to be returned to <- this is far worse than the first point because this memory was already allocated and would be collected by next gen2 collection, in other words, pool won't behave like a pool every X minutes under heavy load

A bit better approach is to trim buffers instead of completely flushing them which is impossible without locking.

Even with that in mind, Timer approach is bad because flushed buffers would not cause a gen2 collection, it would be caused after some unknown amount time after timer callback (which can be called multiple times before actual gen2 collection occurs). Pre-gen2 collection callback is far more preferable than time-based callback. Right before gen2 collection we know 2 things:

Memory was under high pressure last arbitrary quant of time and we should give away memory which we don't actually need (a)
Right after this callback code gen2 collection would 100% occur without any delays

*We actually also know that all managed code is suspended at that moment but right now I can't see if this info can help us.

(a) memory should be given out with caution, with some background statistics about app runtime idealy which is not an option for general use case library. But we can gather stats during runtime, yes it would fail to predict optimal numbers for some execution time but in the long run it should be good.

I have some thoughts about theoretical implementation but I need some time to think and create a more 'readable' proposal so it won't look like a giant piece of text.

JimBobSquarePants commented 3 years ago

Yet another problem with current implementation - race condition. Neither Configuration.MemoryAllocator nor ArrayPoolMemoryAllocator locks in rent/returnal code which can lead to many different entities renting same buffer.

@br3aker They're the same thing if set which is the default behavior. Locking during Rent/Return occurs in the underlying ConfigurableArrayPool<T>.

I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool<T> that would be impossible.

br3aker commented 3 years ago

@JimBobSquarePants I must be blind, thanks for clarifying about locks, didn't see that _lock.Enter(...) call at first glance.

Yep, new solution would require entirely new pool implementation which I'm willing to try in the upcoming week(s). I'll open a draft if anything works out.

antonfirsov commented 3 years ago

@br3aker on .NET Core there is a trick to detect if a gen2 GC is happening:

https://github.com/dotnet/runtime/blob/01b7e73cd378145264a7cb7a09365b41ed42b240/src/libraries/System.Private.CoreLib/src/System/Gen2GcCallback.cs

What I would do is to trim some of the retained buffers on every gen2 GC (eg. 1/2 or 3/4 or similar). I think this would handle the cases you pointed out in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-860081499 without complex statics.

Note that the contextual batching API you referred to in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-860066580 already exists: you can create an pass around a Configuration with it's own MemoryAllocator and call configuration.MemoryAllocator.ReleaseRetainedResources(). We need something that works by default.

br3aker commented 3 years ago

@antonfirsov yeah I understand about "better memory handling by default", contextual batching api idea was more of a feature around new built-in allocator. Current API is a bit misleading with creating a new allocator and then manually calling ReleaseRetainedResources() but that's an another day discussion, sorry for the confusion in this already complicated topic.

JimBobSquarePants commented 3 years ago

Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.

@antonfirsov Would the 2MB uniform limit have any affect upon our ability to work with wrapped buffers. e.g SwapOrCopy. My memory is hazy there.

antonfirsov commented 3 years ago

SwapOrCopy is now implemented on the MemoryGroup level, so it should work fine:

https://github.com/SixLabors/ImageSharp/blob/97dde7fb2f0ac97b5b1f9e6b0555a53d2e270c46/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup%7BT%7D.cs#L179-L198

saucecontrol commented 3 years ago

I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool that would be impossible.

This would seem to be the way in the short term. The way it trims in response to Gen2GcCallback is hacky but clever and well-tested. You'd want to simplify it to not hold a bucket per size per thread since we're talking about larger buffers, though. Maybe some hybrid between BCL's TlsOverPerCoreLockedStacksArrayPool and ConfigurableMemoryPool.

I think regardless, you'll end up with times you need to allocate something over the pool limit, and falling back to unmanaged memory (with a SafeHandle to protect against failure to Dispose) would be a big improvement there. Those extra large allocations living until the next full GC can be a killer.

antonfirsov commented 3 years ago

GC.GetGCMemoryInfo() doesn't look like a reliable thing: https://github.com/dotnet/runtime/issues/55126

We should not use it, instead, we should always trim the pool by a certain percentage on every Gen2 GC.

JimBobSquarePants commented 3 years ago

Subscribed! Very curious to see what the issue is.

Sergio0694 commented 3 years ago

Very late to the party (super interesting conversation by the way!), just thought I'd throw this out there too as it seems somewhat relevant to the new issue Anton opened, and potentially useful?

https://github.com/dotnet/runtime/issues/53895

This is essentially meant to be a more reliable and less hacky way to get memory pressure based callbacks than what is achievable today by using Gen2Callback that Clint mentioned 😄

antonfirsov commented 3 years ago

@Sergio0694 yeah, seen that, cool stuff, thanks a lot for pushing it!

antonfirsov commented 3 years ago

I made some progress with a prototype on a branch. As a next step, I need to start some benchmarking to determine optimal parameters, however I'm not sure what the pool sizes are considered to be still "sane".

It can be any value between a few megabytes, up to several gigabytes on many-core systems. Note that https://github.com/dotnet/runtime/pull/55621 went very aggressive for ArrayPool<T>.Shared.

Maybe we could define it as a percentage of TotalAvailableMemoryBytes? (However it won't work well automatically with stuff like docker's --memory switch.)

JimBobSquarePants commented 3 years ago

Wow, the limit has been removed entirely!

So what's you're current plan then? An admittedly very quick look at your branch yields to me what seems to be 3 pools now?

ArrayPool<byte>.Shared <= 1MB
UniformByteArrayPool - Which doesn't appear to do any cleanup? <= 48 MB
UnmanagedMemoryAllocator > 48MB

Isn't using the UniformByteArrayPool in this manner going to lead to the same issue as we are seeing where we hold on to memory long after it is useful?

I feel like the DefaultMemoryAllocator is currently offering too many options in its constructor and that users will find them confusing (I find the two contiguous options confusing and I know how the library works). If someone wants to wrap memory or cares about contiguous buffer length perhaps they should simply switch to the SimpleGcMemoryAllocator or roll their own?

antonfirsov commented 3 years ago

@JimBobSquarePants it's still in a prototype state. Trimming is not yet implemented, first I want to figure out the optimal parameter sizes for the pool size and other things. Current WIP default is 256 MB, 48 MB is too low for stress cases.

I think we should hide the implementing type (DefaultMemoryAllocator or whatever) and make the public API implementation-agnostic this time:

public class MemoryAllocator
{
    public static MemoryAllocator CreateDefault(
        int maxContiguousBlockSizeInBytes = DefaultContiguousBlockSizeInBytes,
        int totalPoolSizeInBytes = DefaultTotalPoolSizeInBytes);
}

We can introduce utilities to help interop use cases, which we will likely break by scaling down the default contiguous buffer size (see #1307 and #1258):

public static class SizeCalculator
{
    public static GetImageSizeOf<T>(int width, int height);
}

// Customize the allocator for GPU interop use cases:

Configuration.Default.MemoryAllocator = MemoryAllocator.CreateDefault(
    maxContiguousBlockSizeInBytes: SizeCalculator.GetImageSizeOf<Rgba32>(4096, 4096));

JimBobSquarePants commented 3 years ago

Yep. I like all these ideas!

antonfirsov commented 3 years ago

@JimBobSquarePants @saucecontrol I was already at final steps polishing my implementation, when I discovered an issue with the UniformByteArrayPool approach. Looks like the GC is unable to decommit it's LOH pages despite trimming UniformByteArrayPool a couple of times (or even clearing it entirely with a call to .ReleaseRetainedResources()) if the pages are even slightly interleaved with other LOH allocations (typically coming from ArrayPool.Shared).

The following graph shows the committed memory for an experiment which attempts to free all the memory in the end:

For comparison here is the same experiment, with a slight alteration: setting GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce before GC.Collect(), showing that after compacting the LOH the memory is decommitted:

What to do now?

We can do better job by getting rid of GC for large buffers entirely, and refactoring UniformByteArrayPool into a pool of unmanaged buffers, but this could possibly prolong the PR for a couple of weeks.

To deliver something to the users in the meanwhile, we can also consider to PR an unmanaged allocator without any pooling. From what I see now in large scale stress benchmarks, UniformByteArrayPool adds about 5% to the throughput in comparison to a non-pooling unmanaged allocator, but a non-pooling unmanaged allocator has already 9% better throughput than our current ArrayPoolMemoryAllocator, and much better memory characteristics.

JimBobSquarePants commented 3 years ago

Oh wow! Interesting and slightly disappointing result. I'm glad you've shown your level of thoroughness. I would have likely missed something like this.

A few more weeks is no problem. It gives us time to review and merge the WebP code and fix a few more issues. Plus I'm trying to focus on Fonts and Drawing right now to get them closer to RC status.

antonfirsov commented 3 years ago

@jaddie there is a WIP PR #1730, and an API proposal in #1739, discussions shifted there. The required change is so fundamental, that we decided to do a major version bump, so our next release containing the memory fix will be ImageSharp 2.0.

The memory PR is expected to be finished within 1 or 2 months (not that much work left, but I'm doing it in my free time), however we also want to include WebP in that release, which may prolong things too. Hopefully we will be still able to deliver 2.0 before the end of 2021.

If you are considering to help: I need users with production-like environments to test and validate #1730. Let me know if you are interested.

cshung commented 3 years ago

@antonfirsov With regard to the issue that LOH cannot decommit memory, have you tried the GCSettings.LargeObjectHeapCompactionMode Property?

antonfirsov commented 3 years ago

Yes, the second graph in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-890963976 is made with GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce set in the end.

The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact. I'm not excluding that it can be a lesser evil in our case than going unmanaged and sacrificing the simplicity of our pixel manipulation APIs as a result (see #1739), but I can't commit to such a decision without consulting a GC expert.

@cshung if you think you have some time to chat about this, let me know, I would really appreciate the help!

SixLabors / ImageSharp

More efficient MemoryAllocator #1596

What to do now?