dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

list of usage examples that illustrate current problems with ArrayPool and what the desired behavior should be #52098

Open Maoni0 opened 3 years ago

Maoni0 commented 3 years ago

this issue is for folks who want ArrayPool (or other memory pools) to be improved. please explain it with usage examples. a usage example should consist of -

then we have a discussion what kinds of behavior our pooling should offer and then we can decide if we need support from the GC and what that would be.

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jkotas commented 3 years ago

Problem 1: ArrayPool does not ever release pooled memory, unless there is a memory pressure.

using System;
using System.Buffers;

DoStuff<int>();
DoStuff<uint>();
DoStuff<long>();
DoStuff<ulong>();
DoStuff<DateTime>();
DoStuff<DateTimeOffset>();

// Let's say the above was done during startup and the rest of the application won't use those ArrayPools again

Console.WriteLine(GC.GetTotalMemory(true));

void DoStuff<T>()
{
    ArrayPool<T>.Shared.Return(ArrayPool<T>.Shared.Rent(1000000));
}

Actual behavior: Array pools are keeping 50MB forever Expected behavior: The memory is released after it is not used for a while

dotnet/runtime libraries are reluctant to use ArrayPools for anything but byte and char to avoid hitting this issue.

jkotas commented 3 years ago

Problem 2: Array pool does not pool enough

using System;
using System.Buffers;

for (int i = 0; i < 1000000; i++)
    Nested(100);

Console.WriteLine(GC.CollectionCount(0));

void Nested(int depth)
{
   if (depth == 0) return;

   var a = ArrayPool<byte>.Shared.Rent(10);
   Nested(depth-1);
   ArrayPool<byte>.Shared.Return(a);
}

Expected: 0 or very few GCs Actual: Hundreds of GCs

We have to watch for using the array pool too much that would lead to hitting the threasholds. The Array pool should dynamically resize based on the application behavior.

jkotas commented 3 years ago

Problem 3: ArrayPool does not work for very large arrays

using System;
using System.Buffers;

for (int i = 0; i < 1000; i++)
   ArrayPool<byte>.Shared.Return(ArrayPool<byte>.Shared.Rent(10_000_000));

Console.WriteLine(GC.CollectionCount(2));

Expected: 0 or very few GCs Actual: Hundred of Gen2 GCs

This implementation limitation largely exists to reduce severity of Problem 1.

A side-effect is that code tends to use array pool limits for buffer sizes instead of figuring about what the rigth buffer size should actually be. Example from this week: https://github.com/dotnet/runtime/pull/49304/files#diff-7721d7d6b3bf50b13337accfc3a3857f235a3901e99c6f0a9ff4c2c5f84c2784R49

jkotas commented 3 years ago

cc @davidfowl

ghost commented 3 years ago

Tagging subscribers to this area: @tannergooding See info in area-owners.md if you want to be subscribed.

Issue Details
this issue is for folks who want ArrayPool (or other memory pools) to be improved. please explain it with usage examples. a usage example should consist of - + pseudo code snippet to illustrate your scenario + description of current problems + description of your desired behavior, eg, your pool is obviously part of the memory usage in a process, do you want your pool to have an individual threshold that you get notified when it's exceeded and you will do something about it, or do you want to with a default size and leave it up to whenever GC decides to collect. + any other perf implications you want to mention, eg, does your pool only manage buffers that will need to pinned, what kind of objects do you want to have in this pool, things like byte[] only or any type then we have a discussion what kinds of behavior our pooling should offer and then we can decide if we need support from the GC and what that would be.
Author: Maoni0
Assignees: -
Labels: `area-System.Runtime`, `tenet-performance`
Milestone: Future
Maoni0 commented 3 years ago

thanks @jkotas! I will take a look at these hopefully in the next couple of weeks.

antonfirsov commented 3 years ago

Important aspect of Problem 1 (https://github.com/dotnet/runtime/issues/52098#issuecomment-830106074) is that ArrayPools are per-type. A 3rd library party may unintentionally retain a bunch of memory by using ArrayPool<SomeLargeUniqueStruct>.Shared (this actually happened in an early alpha of ImageSharp).

davidfowl commented 3 years ago

Another thing I would like to do is get rid of the Kestrel memory pool since it's at this point a glorified bunch of pooled buffers 4K on the pinned object heap. Today these ArrayPool buffers aren't pinned making them less than ideal for networking scenarios.

Maoni0 commented 3 years ago

Problem 2: Array pool does not pool enough

using System;
using System.Buffers;

for (int i = 0; i < 1000000; i++)
    Nested(100);

Console.WriteLine(GC.CollectionCount(0));

void Nested(int depth)
{
   if (depth == 0) return;

   var a = ArrayPool<byte>.Shared.Rent(10);
   Nested(depth-1);
   ArrayPool<byte>.Shared.Return(a);
}

Expected: 0 or very few GCs Actual: Hundreds of GCs

We have to watch for using the array pool too much that would lead to hitting the threasholds. The Array pool should dynamically resize based on the application behavior.

@jkotas for this problem I cannot repro what you saw. I printed out a bit more info -

elapsed 28222 ms, 0 gen0 GCs, 0 gen1, 0 gen2, allocated 17,264 bytes

it allocates very little and no GCs were triggered. this is with the current .net 6 build.

jkotas commented 3 years ago

The array pool global cache size is 8 * Environment.ProcessorCount arrays per size bucket. I guess you have run it on 16+ core machine and so the pool was big to avoid allocations. Here is a modification that should produce the result reliably:

using System;
using System.Threading;
using System.Buffers;

int processors = Environment.ProcessorCount;
for (int i = 0; i < processors; i++)
   new Thread(Work).Start();
Work();

Console.WriteLine(GC.CollectionCount(0));

void Work()
{
    for (int i = 0; i < 1000000; i++)
        Nested(100);
}

void Nested(int depth)
{
   if (depth == 0) return;

   var a = ArrayPool<byte>.Shared.Rent(10);
   Nested(depth-1);
   ArrayPool<byte>.Shared.Return(a);
}
Maoni0 commented 3 years ago

I see. yes I did run it on a big machine (way more than 16 cores). I can repro with your new example. thanks!

danmoseley commented 3 years ago

@geoffkizer @stephentoub

br3aker commented 3 years ago

Problem 4: ConfigurableArrayPool rent logic isn't too memory-friendly with this try-next-bucket scenario

Current pooling implementation can waste up to (currentBucketSize / 2) - 1 "elements" in the worst case when we try to rent arrays of size bigger by 1 than bucket size, e.g. rent(17) will fit in bucket of 32 elements and waste 15 elements of space.

https://github.com/dotnet/runtime/blob/e39196ffc9e50f3d8927622daaedcd50681d56f2/src/libraries/System.Private.CoreLib/src/System/Buffers/ConfigurableArrayPool.cs#L83-L98

With those jumps up to 2 buckets memory waste can skyrocket, e.g. rent(17) could possibly miss 32 bucket and hit 64, wasting up to 47 "elements" (73.4%) in the worst case scenario. Without this edge case, potential pooled resource size can be a factor of 2, e.g rent(32) -> bucket:64. This can easily escalate to 1KB -> 2KB waste, 1MB -> 2MB and so on.

Not only it wastes memory, it blocks higher order memory requests and forces it to be allocated (which is a thing for LOH allocations atm).

*This behaviour is restricted to ConfigurableArrayPool only, TlsOverPerCoreLockedStacksArrayPool from ArrayPool<T>.Shared property is not affected.

One of use cases affected by this problem: Batch image thumbnail generation

// Takes large buffer as expected
var original = LoadImage(...);
// Can potentially take another large buffer 
// Because last medium buffer is blocked
var medium = original.Clone(new Size(...));

This code can be run for multiple images in a batch which can cause said problem.

antonfirsov commented 3 years ago

Speaking of ConfigurableArrayPool, I would also call it a problem that the closed implementation of ArrayPool<T>.Shared now strongly deviated from what is publicly available to instantiate for 3rd parties. @Maoni0 if there will be a fix for Problem 1, it would be great to also make it available for ConfigurableArrayPool.


[off] @br3aker --> https://gist.github.com/antonfirsov/d6d8457d09636692dba3b169c1e7790a

geoffkizer commented 3 years ago

Thanks for making these issues known @jkotas. I was unaware of these.

Naively I would expect that (a) Every allocation is pooled when returned (b) The pool is scavenged periodically to trim the pool of recently unused items

This seems like it would solve problems (1)-(3) above. Perhaps more importantly, I think it would match user expectations as well.

Maoni0 commented 3 years ago

@geoffkizer I agree and b) is done to some extend currently. I was able to spend some time on this and will have a write up soon that illustrates both the problems and possible solutions.

antonfirsov commented 3 years ago

Note that #55621 fixed "Problem 2: ArrayPool does not pool enough", but made "Problem 1: ArrayPool does not ever release pooled memory, unless there is a memory pressure." much worse.

This may lead to user complaints similar to https://github.com/SixLabors/ImageSharp/discussions/1590 (ArrayPool unexpectedly retains large blocks of memory forever). What would the runtime/BCL fix to this problem look like? Is it still expected to land in 6.0? Any recommendations for 3rd party library authors implementing their own pooling?

/cc @Maoni0 @stephentoub and also @sebastienros as OP of the ImageSharp discussion post.

stephentoub commented 3 years ago

but made "Problem 1: ArrayPool does not ever release pooled memory, unless there is a memory pressure." much worse.

Do you have evidence of that? All of the arrays that PR enabled to be pooled would already be on the LOH.

antonfirsov commented 3 years ago

It's already on the LOH, but when (let's say a byte[] of 500MB) is returned to ArrayPool<byte>.Shared it might be retained there in it's bucket forever, with a chance that no code ever will reuse it. Or am I misunderstanding how it works?

stephentoub commented 3 years ago

My point is that whatever memory pressure would result in a Gen2 collection to collect the array that had been allocated on the LOH would also trigger trimming of arrays stored in the pool. At that point, it's just a matter of tuning as to how much such trimming culls back.

antonfirsov commented 3 years ago

If I'm getting https://github.com/dotnet/runtime/issues/52098#issuecomment-830106074 right, trimming doesn't really trim, unless the process gets into medium/high memory pressure state. --> Users may see the BCL retaining large arrays for no reason.

stephentoub commented 3 years ago

trimming doesn't really trim

It does, it just doesn't trim all the way to nothing in low pressure, in particular because it doesn't trim the thread-local cache unless there's high pressure today. And the cited example would have all of the returned arrays in the thread-local cache, which can store one array of each size in a thread. Run this example instead:

using System.Buffers;

DoStuff<int>();
DoStuff<uint>();
DoStuff<long>();
DoStuff<ulong>();
DoStuff<DateTime>();
DoStuff<DateTimeOffset>();
Console.WriteLine(GC.GetTotalMemory(false));

// Let's say the above was done during startup and the rest of the application won't use those ArrayPools again
while (true)
{
    Console.WriteLine(GC.GetTotalMemory(true));
    Thread.Sleep(10_000);
}

void DoStuff<T>()
{
    T[] one = ArrayPool<T>.Shared.Rent(1_000_000);
    T[] two = ArrayPool<T>.Shared.Rent(1_000_000);
    ArrayPool<T>.Shared.Return(two);
    ArrayPool<T>.Shared.Return(one);
}

the difference just being that it's creating two arrays of each size rather than one. Run it and wait 60 seconds... you'll see the total memory get cut in half when one of each array gets trimmed away.

This is why I said it's about tuning.

antonfirsov commented 3 years ago

I think this issue is about revisiting the current tuning. On the current master, the following code will retain 2GB of memory, without ever freeing it up, using ArrayPool<byte>.Shared only:

for (long size = 1024 * 1024; size <= 1073741824; size *= 2)
{
    byte[] a = ArrayPool<byte>.Shared.Rent((int)size);
    ArrayPool<byte>.Shared.Return(a);
}

While this might seem an artificial scenario, there are use cases which could make it very real. Imagine an application working with very large images (of different sizes) assuming the underlying imaging library relies heavily on ArrayPool<byte>.Shared. The application might be very heavy on imaging operations during a warmup time, but not doing them anymore in the rest of it's execution time (eg. because of caching), so retaining the arrays doesn't make sense anymore. This is essentially what's SixLabors/ImageSharp#1590 is about, with the difference that currently we are using a custom instance of ConfigurableArrayPool<byte> to pool very large arrays.

stephentoub commented 3 years ago

I think this issue is about revisiting the current tuning.

You asked "What would the runtime/BCL fix to this problem look like?" and I'm answering that question: tune the Trim method.

jkotas commented 3 years ago

https://github.com/dotnet/runtime/issues/58974 is another example of unexpected behavior that comes as a surprise: ArrayPool does not always work well when renting/returning happens on different threads.

Bio2hazard commented 3 years ago

The memory pressure is determined through GC.GetGCMemoryInfo().HighMemoryLoadThresholdBytes ( https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/Utilities.cs#L38-L56 ).

On docker containers that enforce a cgroup memory limit, the value for HighMemoryLoadThresholdBytes is actually greater than the TotalAvailableMemoryBytes. ( https://github.com/dotnet/runtime/issues/58974#issuecomment-917308118 )

This means when running inside a container with a cgroup limit, the app will always OOM before Utilities.GetMemoryPressure() returns MemoryPressure.High ( because the threshold for MemoryPressure.High exceeds the amount of memory available ). The window for it to return return MemoryPressure.Medium is also much smaller.

Perhaps consider either applying the fix proposed in https://github.com/dotnet/runtime/issues/58974#issuecomment-917308118 or changing the logic to use TotalAvailableMemoryBytes instead. 👍

tmds commented 3 years ago

The memory pressure is determined through GC.GetGCMemoryInfo().HighMemoryLoadThresholdBytes ( https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/Utilities.cs#L38-L56 ).

This should also consider load of the managed heap in a container, and that will trigger aggressive trimming in @Bio2Hazzard's example application: https://gist.github.com/Bio2hazard/ee353c1042ee56a97c0d0b3d62c590bc.

dennis-yemelyanov commented 2 years ago

I was surprised that ArrayPool<T>.Shared doesn't actually seem to share pooled arrays between threads. I would expect the following code to print "Got uninitialized array" just once, but it prints this message 10 times:

using System.Buffers;

var pool = ArrayPool<int?>.Shared;

// Uncomment next line to get expected behavior
//pool = ArrayPool<int?>.Create(1024 * 1024, 50);

for (int i = 0; i < 10; i++)
{
    var th = new Thread(() =>
    {
        var arr = pool.Rent(1);

        if (arr[0] == null)
        {
            Console.WriteLine("Got uninitialized array");

            arr[0] = 0;
        }

        pool.Return(arr);
    });

    th.Start();

    th.Join();
}

Interestingly, a pool created using ArrayPool<T>.Create() works as expected.

So what does the "Shared" name mean in this context?

saucecontrol commented 2 years ago

ArrayPool<T>.Shared is implemented by TlsOverPerCoreLockedStacksArrayPool, which maintains a pool per thread in order to avoid having to take a lock when renting/returning in the majority case. The comments explain the setup:

The implementation uses a tiered caching scheme, with a small per-thread cache for each array size, followed by a cache per array size shared by all threads, split into per-core stacks meant to be used by threads running on that core. Locks are used to protect each per-core stack, because a thread can migrate after checking its processor number, because multiple threads could interleave on the same core, and because a thread is allowed to check other core's buckets if its core's bucket is empty/full.

You can, as you discovered, create your own pool if that behavior is not desirable for your use case.