Closed antonfirsov closed 2 years ago
Idea 2 is interesting. I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?
For the record, I also hate idea 4, but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts. The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine... I reckon even in 64-bit you'd start paging.
I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?
Implementing the ability to have discontinous buffers was relatively easy, since our processing code usually goes row-by row. We just had upgrade buffer2d.GetRowSpan(y)
so it can take the row from different buffers (buffer boundary is always between rows). Going fully-streamed needs much bigger architectural changes, I would rather start with the "decode into target size" first, and see where we can get from there.
Currently we have a very high limit for images being split to different buffers, what I'm considering is to check what happens if we aggressively lower that limit. I'm unable to assess if it would make thing worse or better.
but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts.
That's why I want to avoid materializing them :) Will open an issue for tracking "decode into target size" stuff tomorrow.
The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine...
I was wondering why do are using the basic Parallel.For
overload that scales parallelism with the machine's CPU count? Is this a way of modeling typical server-side multi-threaded stress?
I'm unable to assess if it would make thing worse or better.
I don't think it would make things worse. If I understand right, your current mode of processing is to apply each transform to the entire image before moving on the next? If so, you're kind of trashing the cache at each stage, so it wouldn't matter whether the memory is contiguous or not. I'm not sure it would make things better either, though. You might want to look at your LOH fragmentation under those high memory conditions and see whether being able to stuff a smaller buffer into an LOH hole would gain you anything. Switching to unmanaged buffers may help a bit there, but heap fragmentation can be a problem in unmanaged as well.
Of course being able to aggregate small buffers in place of a single larger one would be of benefit, provided you don't need them at the same time. I assume if that works for RecyclableMemoryStream, it should work for ImageSharp 🤷♂️
Is this a way of modeling typical server-side multi-threaded stress?
I must admit I didn't put that much thought into it. But yeah, the point was to make sure there were enough threads running to check whether the libraries were capable of true linear scale-up. As long as there are free processors to hide work on (as Vips does today -- and ImageSharp did before your windowed convolution implementation), you can't be sure. If you think about a high-load web server scenario, you would assume the managed thread pool would fully occupy all processors, so I want to know how a library will perform in that environment -- and to be sure that a large number of image processing operations doesn't make for a self-DoS.
BTW, I suspect the anomalous CPU measurements I got for Vips running under the profiler might have actually come from an oversubscription problem. Could be the CPU overhead of the profiler was just enough to throw the whole system into chaos and drop the speed by several multiples. I plan on doing some followup testing to see if I can nail that down.
Couple of other interesting stress cases you might want to look at.
1) 3400+ frame GIF: http://i.imgur.com/Xv4UKYL.gif 2) Lots of good giant JPEGs available on the ESO site, like: https://www.eso.org/public/usa/images/eso1719a/
The reason I'm thinking about "discontiguous memory by default" is not cache coherence, but the fact that it may enable more efficient pooling, reducing the GC pressure (or OS allocation traffic in case of unmanaged buffers). With large images it is very likely that a the requested buffer will outgrow the pool deferring to GC/OS allocating new giant arrays of non-uniform sizes, which would probably result in heap fragmentation even if we are using unmanaged memory.
I'm thinking of a solution that works as following:
ArrayPool<byte>.Shared
)Gen2GcCallback
in TlsOverPerCoreLockedStacksArrayPool
. For example we may halve the retained memory before each gen2 GC.@saucecontrol any thoughts on this particular plan? I wish I could pull in a GC expert to assess it ...
That plan looks sound to me, but I'm no GC expert. I do expect that inability to grab a contiguous large block of memory due to LOH (or process heap) fragmentation is at play in the 32-bit OOM scenario, but it would be good to verify that before diving in to the work if possible.
It seems like if you end up doing the work to completely manage the allocations, you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that. Plus, you'll probably always have to deal with some allocation requests that will be over whatever pooling threshold you set, and you'll want those to go straight to unmanaged so they are as short-lived as possible.
you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that
The problem is that I don't really know when to free up memory. With LOH + the Gen2Callback
trick I can at least rely on GC heuristics with this. (Otherwise it feels like implementing my own GC)
@JimBobSquarePants it just allocates / returns unmanaged memory without pooling. Maybe it's good enough (or even preferable) approach with unmanaged memory, not sure.
My main concern about unmanaged memory is that I would be hesitant to use it as a default, because of the users who may implicitly depend on not disposing Image
. An update would introduce a reliability / security issue for them instead of their existing performance issue.
Instead of holding the raw IntPtr
as a member of the MemoryManager<T>
, that should hold a SafeHandle
wrapping it. Something like:
internal sealed class SafeHGlobalHandle : SafeHandle
{
private readonly int byteCount;
public override bool IsInvalid => handle == IntPtr.Zero;
public SafeHGlobalHandle(int size) : base(IntPtr.Zero, true)
{
SetHandle(Marshal.AllocHGlobal(size));
GC.AddMemoryPressure(byteCount = size);
}
protected override bool ReleaseHandle()
{
if (IsInvalid)
return false;
Marshal.FreeHGlobal(handle);
GC.RemoveMemoryPressure(byteCount);
handle = IntPtr.Zero;
return true;
}
}
Since SafeHandle
has a finalizer, it won't leak, but you can call Dispose
to free it immediately if the MemoryManager<T>
is Disposed properly.
It might be worth implementing something like that for any allocations larger than your max pool buffer size just because you know those will happen and will have a large GC cost. It's also a lot easier to trial than the segmented buffer idea.
FYI https://github.com/dotnet/runtime/issues/52098 is tracking some possible scenarios to improve in ArrayPool/GC interaction. I think Jan's scenarios 1 and 3 (and maybe 2 as well) are applicable to ImageSharp.
There are so many things I hate about ArrayPool, especially ArrayPool<T>.Shared
which is a library-level public shared global state, and as such something that shouldn't exist at all IMO.
All the magic ArrayPoolnew T[]
allocations. (Edit: OK, it was mostly ranting, this approach would require an API like GC.UnsafeReleaseArray(...)
, but but AFAIK array pooling is much less common in Java for example, I assume GC does better job there)
Hi everybody! Have a couple of thoughts on this project I'd like to share. Thanks for such a good topic on memory performance.
First of all, current state of pooling already supports discontiguous buffers. It's just hidden behind MemoryGroup<T>.Allocate(,,,)
call:
First it checks if single stride can fit in single contiguos buffer: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup%7BT%7D.cs#L85
Then it does strange things: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup%7BT%7D.cs#L99-L107
Details aside - this alloc call either throws or pools single contiguous buffer each time. So here goes first problem: most of the time pool is stressed by huge arrays from buckets of higher end of indices when low indices are empty which makes maxArraysPerBucket
config value completely irrelevant.
Edit: this can't be fixed by simply rewriting MemoryGroup<T>.Allocate(,,,)
call, at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption. Haven't checked other processors but I guess it might also be a problem somewhere else.
Edit: huge mistake in calculations and hardcoded upper bound of pool max size (it's hex, not dec), this still implies but for certain pool setups, see post below. But there's a far more complicated problem: pooling doesn't work in certain situations. This is solely due to current implementation of the pool in .net, it silently clamps maximum size of the array:
const int MinimumArrayLength = 0x10, MaximumArrayLength = 0x40000000;
if (maxArrayLength > MaximumArrayLength)
{
maxArrayLength = MaximumArrayLength;
}
Which leads to unexpected LOH allocations when working with images up to ~40mb of pixel data and that isn't very rare:
rgb24 1080p jpeg is 24 * 1920 * 1080 = ~49mb
of decoded pixel data.
Bee heads test completely ignores pooling because it consists of rgb24 jpeg`s with at least 3000x3000 pixels.
Parallel.Foreach bee heads can skyrocket memory usage (been counting gigabytes of work set during execution) with frequent GC(2) collections.
You've got a couple of translation errors in your analysis. First, the hard-coded MaximumArrayLength = 0x40000000
isn't 40MB -- it's hex, so that's 1GB. Second, RGB24 is 24 bits, or 3 bytes, per pixel. So a decoded 1920x1080 RGB image is 6MB.
But you're right about the LOH allocations. The default memory pool is limited to 24MB per array, so images over ~8 megapixels will cause allocations, and there are definitely images in the bee heads set that exceed that limit.
The challenge is in achieving a balance between avoiding GC allocations and holding a high steady-state memory level when large arrays get pooled.
Oh gosh, you're right, math was not on my side tonight, thanks for pointing out.
Worst part about built-in .net pool is that it was developed for general use-case. Exponential growth of internal buffers would waste a lot of space in certain conditions. Empty buckets are not a big deal but they are just sit there for no reason. Pixel buffer consisting of several unmanaged chunks is a much better choice, plus custom alignment and more reliable memory deallocation.
Although GC.AddMemoryPressure
looks counterproductive in this particular situation - we don't want to affect GC at all while using pool.
IMO all allocations should be divided into 2 categories: pixel data which outlives any other allocation type in context of this library and everything else. Main memory consumers are pixel buffers which cause massive hits on both GC heaps and theoretical custom memory pool based on unmanaged memory - it'll be much harder to support large buffers while allowing to pool small buffers of integers/small structs.
This way we can delegate those small allocations to GC (thx to tiers this won't really affect its performance) while concentrating on storing pixel buffers with proper alignment.
Another interesting concept mentioned by @antonfirsov - async/await for memory pooling. This approach could facilitate environments where memory is crucial. But it might be really tricky not to fall into the deadlock when there won't be enough memory for any of the awaiters.
Anyway, I don't think that it's a good solution to avoid locks in pooling logic - some cases simply don't want to work with TPL - calling something like allocator.GetMemoryAsync(size).Result
looks ugly and performs even worse than it looks.
Asynchronous pooling can use separate memory pool so it won't interfere with synchronized pooling. And I still don't think it's worth spending time implementing this atm, better to concentrate on basic pool implementation.
@br3aker reacting to https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-837742728 first:
this alloc call either throws or pools single contiguous buffer each time.
at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption
If any of these are true, it's a bug. You should be able to limit the maximum contiguous buffer size by using the following full ArrayPoolMemoryAllocator
constructor, and setting bufferCapacityInBytes
:
~You can pass the default values for the rest of the parameters.~ -- Except bucket sizes, you need to increase them significantly to make sure the pool is not drained.
Note that the default for bufferCapacityInBytes
is a very high value today (equivalent of ~130 Megapixels of RGBA32), so this may explain that you see contiguous buffers without customization.
I believe lowering the capacity + setting a higher bucket size should help with the biggest "pooling doesn't work in certain situations" problem. If you have a repro proving this fails on the load+resize+save path (= contiguous buffer is allocated or exception is thrown) can you please open a new separate issue for that?
Regarding the rest of the comments:
I believe all mentioned issues can be addressed either by a very basic unmanaged allocator or by the "smart" allocator described in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-821146358. None of those would use the BCL ArrayPool<T>
for large buffers, and therefore won't suffer from the asymmetric bucket usage problem.
Main memory consumers are pixel buffers
I wish this was the case, but this is not true: #1597. This our top-priority memory issue now, effecting the most common load+resize+save path. Even if we manage to fully optimize it, the "fancy" (blur, sharpen etc.) processors may still keep allocating all kinds of large Buffer2D<T>
-s of Width * Height
size. I don't see a conceptual difference between pixel buffers backing Image<T>
and other internal buffers.
Oh yeah, I've haven't actually checked all the processors. What I meant is that allocator shouldn't be bothered with small allocations like here: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs#L497-L507
Those buffers are not actually big: https://github.com/SixLabors/ImageSharp/blob/a8cb71158493096121439d937b7e79aa5ea2727a/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs#L64-L66
Without proper investigation I thought several places had this kind of small allocations but I was wrong, that's the only place I could find with allocator.Allocate
call with small buffers (compared to other allocations). My bad, sorry for the rustle.
https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-838175639 I changed bucket size prior to experimenting with non-contiguos buffers, sorry for not mentioning that. Will try to reproduce and write proper issue/comment today.
Hello,
I saw this issue pop in my twitter feed and I was curious about it, I hope you'll forgive me for butting in 👀
I didn't see it mentioned, but at least in the scenario of very large pixel buffers, an option could be to rely on MemoryMappedFile
for allocating unmanaged memory in the form of raw memory pages.
As an alternative to Marshal.AllocHGlobal
, this would avoid fragmenting the native heap (used by AllocHGlobal
) and the LOH, at least in a 64 bit process. In a 32 bit process, address space fragmentation could still be a problem.
Similary to AllocHGlobal
/FreeHGlobal
, the memory you get from MemoryMappedFile
is allocated and deallocated immediately.
However, the memory pages used by a memory mapped files should be freed pages immediately, where an heap/LOH based approach could keep the pages in memory. (e.g. because they are in the middle of the heap or still partially used)
⚠ One possible drawback could be a performance hit associated with allocating/deallocating the memory pages for the memory mapped file (this would need to be benchmarked), but this should still be worthwile for very large buffers, as a heap-based mechanism would also need to allocate new pages.
Of course, this way of allocating memory could also be used with pooling and discontinous buffers. What it does at the core is give you more control over how much physical memory is allocated, or at least reserved in your address space. It could be worthwile if you were to chose to pool your own unmanaged buffers.
For the sake of completeness, you can find a rough example of usage of MemoryMappedFile
to allocate private memory here.
Regarding discontigous buffers. If we went that route we would likely not be able to pin anything. There's a couple of places we do so currently.
@kkokosa @Maoni0 My sincerest apologies for the name drop but since you are both absolute experts in memory management and the GC it would be wonderful if we could have your opinion on the best design approach.
@JimBobSquarePants in the current design MemoryAllocator.Allocate<T>
will always allocate a contiguous buffer, while GetBufferCapacityInBytes
works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D
, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.
@GoldenCrystal using MemoryMapped files is a good idea, but it's the most complex option of all, needs a lot of research, and has to be implemented with care. If possible, I would prefer to go with managed LOH arrays, because anything unmanaged is a breaking change (an application leaking IDisposable
Image
-s updates to new version of the library introducing a security risk by actually starting to leak unmanaged memory).
In case anyone chooses to chime in, but the thread is TLDR:
I'm thinking about a custom pool of uniform (~2MB) LOH arrays, that can be used build larger discontiguous buffers to back ImageSharp's Image<TPixel>
and other very large internal temporary "2D" buffers.
Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient? Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)
@JimBobSquarePants in the current design
MemoryAllocator.Allocate<T>
will always allocate a contiguous buffer, whileGetBufferCapacityInBytes
works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of aBuffer2D
, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.
Yep but this had me concerned.
Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream.
Regarding GetBufferCapacityInBytes
. ArrayPoolMemoryAllocator
currently uses this as more than a hint and throws an InvalidMemoryOperationException
if the requested 1D buffer is larger than BufferCapacityInBytes
which I'm not currently a fan of.
Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient?
Honestly, current problems with OutOfMemory exceptions look like a bug rather than real memory management problems. I can't believe that even in parallel tests 8 (or even 16) cores can spend up to 8gb in its peak scenario on such a simple task (source - beeheads demo by @saucecontrol)
Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)
One of the simpliests yet the most 'controlable' thing from user perspective you can do (especially good for bulk processing):
using (var context = new Context())
{
using Image img = Image.Load(..., context);
using Image resized = img.Resize(..., context);
resized.Save(..., context);
}
Unfortunately, this would lead to a lot of public API changes/additions but it allows to manually control when processing is over and resources can be freed.
Problem with LOH allocations is that LOH can't actually trigger GC as reliably as non-LOH allocations do. We have 2 rather rare scenarions of LOH collection: 1. Low available memory to allocate non-LOH memory -> full GC collection cycle for freeing all memory 2. Low available memory to allocate new LOH buffer for pooling -> full GC collection cycle as gen2 triggers both gen0 and gen1
(1) won't be caused by pooling if we use LOH buffers (2) pool won't allocate new buffers after some time - it would either use buffers 100% for some processing or would have a lot of free buffers
What I wanted to state is that LOH collection is a really rare occasion unless triggered manually which is not what library code can reliably control.
Yes, we can try to force gen2 GC collection at pool buffer returnal:
// somewhere in pool class
public void Return(byte[] buffer) {
// buffer returnal logic
usedBuffersCount--;
if(usedBuffersCount / allocatedBuffersCount < somePercentage) {
GC.Collect(2);
}
}
But this should be done with care as gen2 collection could cause a significant performance hit.
Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)
GC.GetGCMemoryInfo is .NET Core 3.0+
For older runtimes, no idea currently. Probably a hard limit or configurable value.
There's also a problem with memory usage patterns.
This is an approximation of how load-resize-save pool buffers usage looks like:
This is an approximation of how 'process something once and forget' use case looks like:
Somehow pooling logic should detect both of these cases and should not clear internal buffers in the first case.
I don't think there's any other solution rather than statistical pooling which would tweak behaviour during runtime based on how memory was pooled/returned over the lifetime of the entire app. Stats can be gathered either during pool/return calls or with equal quants of time using Timer
(or both?).
@br3aker I've been experimenting with an implementation that should handle both scenarios in .NET Core 3.1+
https://github.com/SixLabors/ImageSharp/tree/ad96db757cc9b564bfd6db6d738fb648500e3f38
Check out the various ArrayPoolMemoryAllocator
partials.
I've tried to keep things simple and have taken on board various suggestions including @saucecontrol -s one regarding using the shared pool for small buffers.
@JimBobSquarePants unfortunately, there's simply no way to get MemoryLoadBytes
HighMemoryLoadThresholdBytes
pre GCMemoryInfo
going public. Maybe with reflection somewhere deep in the private details of runtimes that info could be gathered but I doubt that it would lead to a good implementation.
Yet another problem with current implementation - race condition. Neither Configuration.MemoryAllocator
nor ArrayPoolMemoryAllocator
locks in rent/returnal code which can lead to many different entities renting same buffer.
Considering all variables (different formats, use cases, image sizes), chances of this causing problems are very unlikely but it's a very serious problem nonetheless. And that's where Timer
comes in - it also needs synchronization. While it wouldn't result in race condition, it would result in a huge memory re-allocation:
A bit better approach is to trim buffers instead of completely flushing them which is impossible without locking.
Even with that in mind, Timer
approach is bad because flushed buffers would not cause a gen2 collection, it would be caused after some unknown amount time after timer callback (which can be called multiple times before actual gen2 collection occurs). Pre-gen2 collection callback is far more preferable than time-based callback. Right before gen2 collection we know 2 things:
*We actually also know that all managed code is suspended at that moment but right now I can't see if this info can help us.
(a) memory should be given out with caution, with some background statistics about app runtime idealy which is not an option for general use case library. But we can gather stats during runtime, yes it would fail to predict optimal numbers for some execution time but in the long run it should be good.
I have some thoughts about theoretical implementation but I need some time to think and create a more 'readable' proposal so it won't look like a giant piece of text.
Yet another problem with current implementation - race condition. Neither Configuration.MemoryAllocator nor ArrayPoolMemoryAllocator locks in rent/returnal code which can lead to many different entities renting same buffer.
@br3aker They're the same thing if set which is the default behavior. Locking during Rent/Return occurs in the underlying ConfigurableArrayPool<T>
.
I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool<T>
that would be impossible.
@JimBobSquarePants I must be blind, thanks for clarifying about locks, didn't see that _lock.Enter(...)
call at first glance.
Yep, new solution would require entirely new pool implementation which I'm willing to try in the upcoming week(s). I'll open a draft if anything works out.
@br3aker on .NET Core there is a trick to detect if a gen2 GC is happening:
What I would do is to trim some of the retained buffers on every gen2 GC (eg. 1/2 or 3/4 or similar). I think this would handle the cases you pointed out in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-860081499 without complex statics.
Note that the contextual batching API you referred to in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-860066580 already exists: you can create an pass around a Configuration
with it's own MemoryAllocator
and call configuration.MemoryAllocator.ReleaseRetainedResources()
. We need something that works by default.
@antonfirsov yeah I understand about "better memory handling by default", contextual batching api idea was more of a feature around new built-in allocator. Current API is a bit misleading with creating a new allocator and then manually calling ReleaseRetainedResources()
but that's an another day discussion, sorry for the confusion in this already complicated topic.
Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.
@antonfirsov Would the 2MB uniform limit have any affect upon our ability to work with wrapped buffers. e.g SwapOrCopy
. My memory is hazy there.
SwapOrCopy
is now implemented on the MemoryGroup
level, so it should work fine:
I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool
that would be impossible.
This would seem to be the way in the short term. The way it trims in response to Gen2GcCallback
is hacky but clever and well-tested. You'd want to simplify it to not hold a bucket per size per thread since we're talking about larger buffers, though. Maybe some hybrid between BCL's TlsOverPerCoreLockedStacksArrayPool
and ConfigurableMemoryPool
.
I think regardless, you'll end up with times you need to allocate something over the pool limit, and falling back to unmanaged memory (with a SafeHandle to protect against failure to Dispose) would be a big improvement there. Those extra large allocations living until the next full GC can be a killer.
GC.GetGCMemoryInfo()
doesn't look like a reliable thing: https://github.com/dotnet/runtime/issues/55126
We should not use it, instead, we should always trim the pool by a certain percentage on every Gen2 GC.
Subscribed! Very curious to see what the issue is.
Very late to the party (super interesting conversation by the way!), just thought I'd throw this out there too as it seems somewhat relevant to the new issue Anton opened, and potentially useful?
https://github.com/dotnet/runtime/issues/53895
This is essentially meant to be a more reliable and less hacky way to get memory pressure based callbacks than what is achievable today by using Gen2Callback
that Clint mentioned 😄
@Sergio0694 yeah, seen that, cool stuff, thanks a lot for pushing it!
I made some progress with a prototype on a branch. As a next step, I need to start some benchmarking to determine optimal parameters, however I'm not sure what the pool sizes are considered to be still "sane".
It can be any value between a few megabytes, up to several gigabytes on many-core systems. Note that https://github.com/dotnet/runtime/pull/55621 went very aggressive for ArrayPool<T>.Shared
.
Maybe we could define it as a percentage of TotalAvailableMemoryBytes
? (However it won't work well automatically with stuff like docker's --memory
switch.)
Wow, the limit has been removed entirely!
So what's you're current plan then? An admittedly very quick look at your branch yields to me what seems to be 3 pools now?
ArrayPool<byte>.Shared
<= 1MBUniformByteArrayPool
- Which doesn't appear to do any cleanup? <= 48 MBUnmanagedMemoryAllocator
> 48MBIsn't using the UniformByteArrayPool
in this manner going to lead to the same issue as we are seeing where we hold on to memory long after it is useful?
I feel like the DefaultMemoryAllocator
is currently offering too many options in its constructor and that users will find them confusing (I find the two contiguous options confusing and I know how the library works). If someone wants to wrap memory or cares about contiguous buffer length perhaps they should simply switch to the SimpleGcMemoryAllocator
or roll their own?
@JimBobSquarePants it's still in a prototype state. Trimming is not yet implemented, first I want to figure out the optimal parameter sizes for the pool size and other things. Current WIP default is 256 MB, 48 MB is too low for stress cases.
I think we should hide the implementing type (DefaultMemoryAllocator
or whatever) and make the public API implementation-agnostic this time:
public class MemoryAllocator
{
public static MemoryAllocator CreateDefault(
int maxContiguousBlockSizeInBytes = DefaultContiguousBlockSizeInBytes,
int totalPoolSizeInBytes = DefaultTotalPoolSizeInBytes);
}
We can introduce utilities to help interop use cases, which we will likely break by scaling down the default contiguous buffer size (see #1307 and #1258):
public static class SizeCalculator
{
public static GetImageSizeOf<T>(int width, int height);
}
// Customize the allocator for GPU interop use cases:
Configuration.Default.MemoryAllocator = MemoryAllocator.CreateDefault(
maxContiguousBlockSizeInBytes: SizeCalculator.GetImageSizeOf<Rgba32>(4096, 4096));
Yep. I like all these ideas!
@JimBobSquarePants @saucecontrol I was already at final steps polishing my implementation, when I discovered an issue with the UniformByteArrayPool
approach. Looks like the GC is unable to decommit it's LOH pages despite trimming UniformByteArrayPool
a couple of times (or even clearing it entirely with a call to .ReleaseRetainedResources()
) if the pages are even slightly interleaved with other LOH allocations (typically coming from ArrayPool.Shared
).
The following graph shows the committed memory for an experiment which attempts to free all the memory in the end:
For comparison here is the same experiment, with a slight alteration: setting GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce
before GC.Collect()
, showing that after compacting the LOH the memory is decommitted:
We can do better job by getting rid of GC for large buffers entirely, and refactoring UniformByteArrayPool
into a pool of unmanaged buffers, but this could possibly prolong the PR for a couple of weeks.
To deliver something to the users in the meanwhile, we can also consider to PR an unmanaged allocator without any pooling. From what I see now in large scale stress benchmarks, UniformByteArrayPool
adds about 5% to the throughput in comparison to a non-pooling unmanaged allocator, but a non-pooling unmanaged allocator has already 9% better throughput than our current ArrayPoolMemoryAllocator
, and much better memory characteristics.
Oh wow! Interesting and slightly disappointing result. I'm glad you've shown your level of thoroughness. I would have likely missed something like this.
A few more weeks is no problem. It gives us time to review and merge the WebP code and fix a few more issues. Plus I'm trying to focus on Fonts and Drawing right now to get them closer to RC status.
@jaddie there is a WIP PR #1730, and an API proposal in #1739, discussions shifted there. The required change is so fundamental, that we decided to do a major version bump, so our next release containing the memory fix will be ImageSharp 2.0.
The memory PR is expected to be finished within 1 or 2 months (not that much work left, but I'm doing it in my free time), however we also want to include WebP in that release, which may prolong things too. Hopefully we will be still able to deliver 2.0 before the end of 2021.
If you are considering to help: I need users with production-like environments to test and validate #1730. Let me know if you are interested.
@antonfirsov With regard to the issue that LOH cannot decommit memory, have you tried the GCSettings.LargeObjectHeapCompactionMode Property?
Yes, the second graph in https://github.com/SixLabors/ImageSharp/issues/1596#issuecomment-890963976 is made with GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce
set in the end.
The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact. I'm not excluding that it can be a lesser evil in our case than going unmanaged and sacrificing the simplicity of our pixel manipulation APIs as a result (see #1739), but I can't commit to such a decision without consulting a GC expert.
@cshung if you think you have some time to chat about this, let me know, I would really appreciate the help!
One way to address the concerns in #1590 is to come up with a new
MemoryAllocator
that meets the following two requirements even under very high load: (A) Steady allocation patterns, instead of GC fluctuations (B) Lower amount of memory being retained, at least "after some time"Some ideas to explore:
Point 1. seems to be very simple to prototype. We need an allocator that uses
Marshal.AllocHGlobal
over some threshold, and anArrayPool
below it, and see how the memory timeline goes with the bee heads MemoryStress benchmark in comparison toArrayPoolMemoryAllocator
.@saucecontrol any thoughts or further ideas? (Especially on point 2.)