Closed joshfree closed 3 years ago
the output for !fq has 2 parts - "finalizable objects" and "Ready for finalization objects". all "finalizable objects" says is it's a finalizable object, but since GC has not looked at those objects it doesn't know whether they are dead or not. so it hasn't marked them as "Ready for finalization". if you haven't done a gen2 GC for a while, then your finalizable objects in gen2 are not going away because GC hasn't looked at them; when you do do a gen2 GC, if they are indeed no user references to them they will be "ready for finalization" which means as soon as the GC is over, the finalizer thread will pick them off of the queue and run their finalizers (and then next time a gen2 GC happens, they will be collected).
We are seeing a very similar situation in a Service Fabric cluster. Each node runs several apps (ASP.NET Core APIs) relying on Kestrel. When the cluster nodes are brought live with several thousands of users hitting them, we observe a constant memory increase within all the APIs. The node will eventually become almost unresponsive as all the physical memory is used and the paging file almost doubles the RAM (20GB x 2). The APIs are completely stateless and data is disposed upon utilization (so it is surprising seeing single APIs consuming up to 8/9GB each)
Data we managed to collect so far in the dev environment does not show the application with such a stressful workload, however, we could observe a similar memory usage pattern. What is concerning us mostly right now is Kestrel Libuv handles seem to keep alive tons of other objects (via Connection and HttpContext refs) which are causing the high memory.
A very remarkable difference is obtained by running the apps with the WKS vs SRV GC. Indeed, SRV GC seems to be way too conservative even when the memory pressure arises. WKS GC on the other end runs much more (as expected) but definitely helps in preventing lots of garbage to stick around.
Some numbers
============================================================= SRV mode
MEM_COMMIT 2693 0349e9000 ( 841.910 MB) MEM_RESERVE 719 2
13a7b000 ( 8.307 GB)
Total 1,198,950 Object(s), Total Size: 157.19 MB, Free Objects 28,381(157.12 MB)
Finalizable objs: Heap 0 generation 0 has 6367 finalizable objects (000000002d7ae118->000000002d7ba810) generation 1 has 25 finalizable objects (000000002d7ae050->000000002d7ae118) generation 2 has 1538 finalizable objects (000000002d7ab040->000000002d7ae050) Ready for finalization 0 objects(000000002d7ba810->000000002d7ba810)
Heap 1 generation 0 has 12616 finalizable objects (000000002df20fa8->000000002df399e8) generation 1 has 19 finalizable objects (000000002df20f10->000000002df20fa8) generation 2 has 2010 finalizable objects (000000002df1d040->000000002df20f10) Ready for finalization 0 objects (000000002df399e8->000000002df399e8) Statistics for all finalizable objects (including all objects ready for finalization):Total 22575 objects
Most consuming allocs:
Related "Way to increase x64 GC aggressiveness? (as per x86)" https://github.com/dotnet/coreclr/issues/11338
you are doing gen2's based on the perf counter value. of course the perf counter gen2 collection value just doesn't tell if it's a blocking gen2 or background. you can tell by looking at the total committed bytes right after a gen2 - if it drops a lot it means you are doing blocking gen2s and if you are, it simply means you just have that much memory that needs to be kept alive unless your finalization survivors actually hold onto the majority of the memory (which would be very bad to begin with). the # of finalization survivors does not tell us how much memory is actually survived due to finalization. an ETW trace would be a lot more useful.
thanks @Maoni0 we eventually ran a PerfView trace and tracked this down to a leak of custom objects hooked up to System.ServiceModel and never released after performing WCF client calls bottom line the memory consumption was actually legitimate and gen2 survivors were starving the whole heap eventually causing very bad performance due to frequent GCing
It appears this has been answered hence closing this issue. Please reopen if further clarification is required.
Opening on behalf of @ayende - issue moved from https://github.com/dotnet/corefx/issues/26659 to the coreclr repo:
I'm investigating high memory utilization in one of our systems, and I seeing very strange results:
The final tally is 76,869 objects in the queue. I am doing a lot to try to reuse instances, so it is not surprising that a lot of them are actually in gen 2. What is surprising to me is that in many cases, I have memory hanging around and the only reference to it is in the finalization queue.
I checked the finalizer thread, and it is not blocked. In fact, it appears to be idle.
It looks like it is here: https://github.com/dotnet/coreclr/blob/master/src/vm/finalizerthread.cpp#L462
What is really strange is that I have a large number of objects there are hanging in the finalizer queue that I'm quite sure are properly disposed. And my dispose for them includes
GC.SuppressFinalizer(this)
.I don't have any calls to
GC.ReRegisterForFinalization
and I'm not patching references back to my objects from beyond the grave. I dumped the state of some of the objects, and the plot thickens, because their state shows that they have been disposed properly and that the suppress finalizer was called.I'm using
UvTcpHandle
because is inherits fromSafeHandle
and I assume that this one is properly used :-).Tracing the code, this instance is disposed (it's state is
3
, which means that it has been closed). And when it is disposed it is marked as skip finalizer: https://github.com/dotnet/coreclr/blob/60222780cde77346bff1eb8979846769c223f833/src/vm/safehandle.cpp#L240So the question is, what is it still doing in the finalizer queue?
I expected that when this is called (or the more usual
GC.SuppressFinalizer
, it'll be removed from this queue.In particular, I think that I'm asking about what is the expected behavior from the system. At what point the GC will collect such instances? And what is the expected state of the system in such a scenario?
Another note here. Some of the objects that I would expect to be removed from the finalizer queue are holding a reference to a byte array that is 256KB, so it is on the LOH, might be related? OTOH, the other instances are not holding LOH objects...
Okay, I think that I had a major misunderstanding on my part. I expected the
SuppressFinalize
to remove the object from the finalization queue, and it doesn't do that, it only set a bit that it shouldn't call theFinalize
method. Given that, the rest of the behavior is quite clear.One thing that still remains a question for me is what is the expected behavior from the rest of the system. We have a lot of items in the finalization queue, many of them are only reachable via the queue. Many of them are marked as not needing to call finalize on them. The question is, when will the finalizer queue start removing these values?
I assume that this happens as part of a GC run? And in that case, given that we try to write code that pool and doesn't allocate too much, this is going to be deferred, so we essentially created a lot of garbage, then avoid creating more so the garbage man never comes to clean it up?