Open Maoni0 opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Problem 1: ArrayPool does not ever release pooled memory, unless there is a memory pressure.
using System;
using System.Buffers;
DoStuff<int>();
DoStuff<uint>();
DoStuff<long>();
DoStuff<ulong>();
DoStuff<DateTime>();
DoStuff<DateTimeOffset>();
// Let's say the above was done during startup and the rest of the application won't use those ArrayPools again
Console.WriteLine(GC.GetTotalMemory(true));
void DoStuff<T>()
{
ArrayPool<T>.Shared.Return(ArrayPool<T>.Shared.Rent(1000000));
}
Actual behavior: Array pools are keeping 50MB forever Expected behavior: The memory is released after it is not used for a while
dotnet/runtime libraries are reluctant to use ArrayPools for anything but byte and char to avoid hitting this issue.
Problem 2: Array pool does not pool enough
using System;
using System.Buffers;
for (int i = 0; i < 1000000; i++)
Nested(100);
Console.WriteLine(GC.CollectionCount(0));
void Nested(int depth)
{
if (depth == 0) return;
var a = ArrayPool<byte>.Shared.Rent(10);
Nested(depth-1);
ArrayPool<byte>.Shared.Return(a);
}
Expected: 0 or very few GCs Actual: Hundreds of GCs
We have to watch for using the array pool too much that would lead to hitting the threasholds. The Array pool should dynamically resize based on the application behavior.
Problem 3: ArrayPool does not work for very large arrays
using System;
using System.Buffers;
for (int i = 0; i < 1000; i++)
ArrayPool<byte>.Shared.Return(ArrayPool<byte>.Shared.Rent(10_000_000));
Console.WriteLine(GC.CollectionCount(2));
Expected: 0 or very few GCs Actual: Hundred of Gen2 GCs
This implementation limitation largely exists to reduce severity of Problem 1.
A side-effect is that code tends to use array pool limits for buffer sizes instead of figuring about what the rigth buffer size should actually be. Example from this week: https://github.com/dotnet/runtime/pull/49304/files#diff-7721d7d6b3bf50b13337accfc3a3857f235a3901e99c6f0a9ff4c2c5f84c2784R49
cc @davidfowl
Tagging subscribers to this area: @tannergooding See info in area-owners.md if you want to be subscribed.
Author: | Maoni0 |
---|---|
Assignees: | - |
Labels: | `area-System.Runtime`, `tenet-performance` |
Milestone: | Future |
thanks @jkotas! I will take a look at these hopefully in the next couple of weeks.
Important aspect of Problem 1 (https://github.com/dotnet/runtime/issues/52098#issuecomment-830106074) is that ArrayPools are per-type. A 3rd library party may unintentionally retain a bunch of memory by using ArrayPool<SomeLargeUniqueStruct>.Shared
(this actually happened in an early alpha of ImageSharp).
Another thing I would like to do is get rid of the Kestrel memory pool since it's at this point a glorified bunch of pooled buffers 4K on the pinned object heap. Today these ArrayPool buffers aren't pinned making them less than ideal for networking scenarios.
Problem 2: Array pool does not pool enough
using System; using System.Buffers; for (int i = 0; i < 1000000; i++) Nested(100); Console.WriteLine(GC.CollectionCount(0)); void Nested(int depth) { if (depth == 0) return; var a = ArrayPool<byte>.Shared.Rent(10); Nested(depth-1); ArrayPool<byte>.Shared.Return(a); }
Expected: 0 or very few GCs Actual: Hundreds of GCs
We have to watch for using the array pool too much that would lead to hitting the threasholds. The Array pool should dynamically resize based on the application behavior.
@jkotas for this problem I cannot repro what you saw. I printed out a bit more info -
elapsed 28222 ms, 0 gen0 GCs, 0 gen1, 0 gen2, allocated 17,264 bytes
it allocates very little and no GCs were triggered. this is with the current .net 6 build.
The array pool global cache size is 8 * Environment.ProcessorCount
arrays per size bucket. I guess you have run it on 16+ core machine and so the pool was big to avoid allocations. Here is a modification that should produce the result reliably:
using System;
using System.Threading;
using System.Buffers;
int processors = Environment.ProcessorCount;
for (int i = 0; i < processors; i++)
new Thread(Work).Start();
Work();
Console.WriteLine(GC.CollectionCount(0));
void Work()
{
for (int i = 0; i < 1000000; i++)
Nested(100);
}
void Nested(int depth)
{
if (depth == 0) return;
var a = ArrayPool<byte>.Shared.Rent(10);
Nested(depth-1);
ArrayPool<byte>.Shared.Return(a);
}
I see. yes I did run it on a big machine (way more than 16 cores). I can repro with your new example. thanks!
@geoffkizer @stephentoub
Problem 4: ConfigurableArrayPool
rent logic isn't too memory-friendly with this try-next-bucket scenario
Current pooling implementation can waste up to (currentBucketSize / 2) - 1
"elements" in the worst case when we try to rent arrays of size bigger by 1 than bucket size, e.g. rent(17)
will fit in bucket of 32 elements and waste 15 elements of space.
With those jumps up to 2 buckets memory waste can skyrocket, e.g. rent(17)
could possibly miss 32 bucket and hit 64, wasting up to 47 "elements" (73.4%) in the worst case scenario.
Without this edge case, potential pooled resource size can be a factor of 2, e.g rent(32) -> bucket:64
. This can easily escalate to 1KB -> 2KB waste, 1MB -> 2MB and so on.
Not only it wastes memory, it blocks higher order memory requests and forces it to be allocated (which is a thing for LOH allocations atm).
*This behaviour is restricted to ConfigurableArrayPool
only, TlsOverPerCoreLockedStacksArrayPool
from ArrayPool<T>.Shared
property is not affected.
One of use cases affected by this problem: Batch image thumbnail generation
// Takes large buffer as expected
var original = LoadImage(...);
// Can potentially take another large buffer
// Because last medium buffer is blocked
var medium = original.Clone(new Size(...));
This code can be run for multiple images in a batch which can cause said problem.
Speaking of ConfigurableArrayPool
, I would also call it a problem that the closed implementation of ArrayPool<T>.Shared
now strongly deviated from what is publicly available to instantiate for 3rd parties. @Maoni0 if there will be a fix for Problem 1, it would be great to also make it available for ConfigurableArrayPool
.
[off] @br3aker --> https://gist.github.com/antonfirsov/d6d8457d09636692dba3b169c1e7790a
Thanks for making these issues known @jkotas. I was unaware of these.
Naively I would expect that (a) Every allocation is pooled when returned (b) The pool is scavenged periodically to trim the pool of recently unused items
This seems like it would solve problems (1)-(3) above. Perhaps more importantly, I think it would match user expectations as well.
@geoffkizer I agree and b) is done to some extend currently. I was able to spend some time on this and will have a write up soon that illustrates both the problems and possible solutions.
Note that #55621 fixed "Problem 2: ArrayPool does not pool enough", but made "Problem 1: ArrayPool does not ever release pooled memory, unless there is a memory pressure." much worse.
This may lead to user complaints similar to https://github.com/SixLabors/ImageSharp/discussions/1590 (ArrayPool unexpectedly retains large blocks of memory forever). What would the runtime/BCL fix to this problem look like? Is it still expected to land in 6.0? Any recommendations for 3rd party library authors implementing their own pooling?
/cc @Maoni0 @stephentoub and also @sebastienros as OP of the ImageSharp discussion post.
but made "Problem 1: ArrayPool does not ever release pooled memory, unless there is a memory pressure." much worse.
Do you have evidence of that? All of the arrays that PR enabled to be pooled would already be on the LOH.
It's already on the LOH, but when (let's say a byte[]
of 500MB) is returned to ArrayPool<byte>.Shared
it might be retained there in it's bucket forever, with a chance that no code ever will reuse it. Or am I misunderstanding how it works?
My point is that whatever memory pressure would result in a Gen2 collection to collect the array that had been allocated on the LOH would also trigger trimming of arrays stored in the pool. At that point, it's just a matter of tuning as to how much such trimming culls back.
If I'm getting https://github.com/dotnet/runtime/issues/52098#issuecomment-830106074 right, trimming doesn't really trim, unless the process gets into medium/high memory pressure state. --> Users may see the BCL retaining large arrays for no reason.
trimming doesn't really trim
It does, it just doesn't trim all the way to nothing in low pressure, in particular because it doesn't trim the thread-local cache unless there's high pressure today. And the cited example would have all of the returned arrays in the thread-local cache, which can store one array of each size in a thread. Run this example instead:
using System.Buffers;
DoStuff<int>();
DoStuff<uint>();
DoStuff<long>();
DoStuff<ulong>();
DoStuff<DateTime>();
DoStuff<DateTimeOffset>();
Console.WriteLine(GC.GetTotalMemory(false));
// Let's say the above was done during startup and the rest of the application won't use those ArrayPools again
while (true)
{
Console.WriteLine(GC.GetTotalMemory(true));
Thread.Sleep(10_000);
}
void DoStuff<T>()
{
T[] one = ArrayPool<T>.Shared.Rent(1_000_000);
T[] two = ArrayPool<T>.Shared.Rent(1_000_000);
ArrayPool<T>.Shared.Return(two);
ArrayPool<T>.Shared.Return(one);
}
the difference just being that it's creating two arrays of each size rather than one. Run it and wait 60 seconds... you'll see the total memory get cut in half when one of each array gets trimmed away.
This is why I said it's about tuning.
I think this issue is about revisiting the current tuning. On the current master, the following code will retain 2GB of memory, without ever freeing it up, using ArrayPool<byte>.Shared
only:
for (long size = 1024 * 1024; size <= 1073741824; size *= 2)
{
byte[] a = ArrayPool<byte>.Shared.Rent((int)size);
ArrayPool<byte>.Shared.Return(a);
}
While this might seem an artificial scenario, there are use cases which could make it very real. Imagine an application working with very large images (of different sizes) assuming the underlying imaging library relies heavily on ArrayPool<byte>.Shared
. The application might be very heavy on imaging operations during a warmup time, but not doing them anymore in the rest of it's execution time (eg. because of caching), so retaining the arrays doesn't make sense anymore. This is essentially what's SixLabors/ImageSharp#1590 is about, with the difference that currently we are using a custom instance of ConfigurableArrayPool<byte>
to pool very large arrays.
I think this issue is about revisiting the current tuning.
You asked "What would the runtime/BCL fix to this problem look like?" and I'm answering that question: tune the Trim method.
https://github.com/dotnet/runtime/issues/58974 is another example of unexpected behavior that comes as a surprise: ArrayPool does not always work well when renting/returning happens on different threads.
The memory pressure is determined through GC.GetGCMemoryInfo().HighMemoryLoadThresholdBytes
( https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/Utilities.cs#L38-L56 ).
On docker containers that enforce a cgroup memory limit, the value for HighMemoryLoadThresholdBytes
is actually greater than the TotalAvailableMemoryBytes
. ( https://github.com/dotnet/runtime/issues/58974#issuecomment-917308118 )
This means when running inside a container with a cgroup limit, the app will always OOM before Utilities.GetMemoryPressure()
returns MemoryPressure.High
( because the threshold for MemoryPressure.High
exceeds the amount of memory available ). The window for it to return return MemoryPressure.Medium
is also much smaller.
Perhaps consider either applying the fix proposed in https://github.com/dotnet/runtime/issues/58974#issuecomment-917308118 or changing the logic to use TotalAvailableMemoryBytes
instead. 👍
The memory pressure is determined through GC.GetGCMemoryInfo().HighMemoryLoadThresholdBytes ( https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/Utilities.cs#L38-L56 ).
This should also consider load of the managed heap in a container, and that will trigger aggressive trimming in @Bio2Hazzard's example application: https://gist.github.com/Bio2hazard/ee353c1042ee56a97c0d0b3d62c590bc.
I was surprised that ArrayPool<T>.Shared
doesn't actually seem to share pooled arrays between threads. I would expect the following code to print "Got uninitialized array" just once, but it prints this message 10 times:
using System.Buffers;
var pool = ArrayPool<int?>.Shared;
// Uncomment next line to get expected behavior
//pool = ArrayPool<int?>.Create(1024 * 1024, 50);
for (int i = 0; i < 10; i++)
{
var th = new Thread(() =>
{
var arr = pool.Rent(1);
if (arr[0] == null)
{
Console.WriteLine("Got uninitialized array");
arr[0] = 0;
}
pool.Return(arr);
});
th.Start();
th.Join();
}
Interestingly, a pool created using ArrayPool<T>.Create()
works as expected.
So what does the "Shared" name mean in this context?
ArrayPool<T>.Shared
is implemented by TlsOverPerCoreLockedStacksArrayPool
, which maintains a pool per thread in order to avoid having to take a lock when renting/returning in the majority case. The comments explain the setup:
The implementation uses a tiered caching scheme, with a small per-thread cache for each array size, followed by a cache per array size shared by all threads, split into per-core stacks meant to be used by threads running on that core. Locks are used to protect each per-core stack, because a thread can migrate after checking its processor number, because multiple threads could interleave on the same core, and because a thread is allowed to check other core's buckets if its core's bucket is empty/full.
You can, as you discovered, create your own pool if that behavior is not desirable for your use case.
this issue is for folks who want ArrayPool (or other memory pools) to be improved. please explain it with usage examples. a usage example should consist of -
then we have a discussion what kinds of behavior our pooling should offer and then we can decide if we need support from the GC and what that would be.