dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Multiple 'System.OutOfMemoryException' errors in .NET 7 #78959

Open theolivenbaum opened 1 year ago

theolivenbaum commented 1 year ago

I'm seeing an issue very similar to this one when running a memory-heavy app on a linux container with memory limit >128GB RAM.

The app started throwing random OutOfMemoryException in many unexpected places since we migrated to net70, while under no memory pressure (usually with more than 30% free memory).

I can see the original issue was closed, but I'm not sure if it was fixed on the final net70 release or if the suggestion to set COMPlus_GCRegionRange=10700000000 is the expected workaround.

Maoni0 commented 1 year ago

I'm confused, if you are still using libclrgc and not getting OOM, and you want to know why it gets OOM without libclrgc, wouldn't you want to get rid of libclrgc and repro the OOM, and then do analysis there?

the corresponding name of coreclr on linux would be libcoreclr.so. so if you want to look at this in windbg, you'd do libcoreclr instead of coreclr.

dave-yotta commented 1 year ago

Sorry for the confusion, let me try to clear it up.

Using the older libclrgc solved the in this comment above.

But we have another problem; we are still using libclrgc in .net7.0.4, and have a lot of allocated native memory and GC time we can't pin down as seen in this comment. eeheap is giving:

GC Allocated Heap Size:    Size: 0x182061e8 (404775400) bytes.
GC Committed Heap Size:    Size: 0x29f58000 (703954944) bytes.

for a process with around 1.8gb resident (same scenario as above).

I also noticed (for one of our other processes in this scenario) that using workstation GC gave better GC performance (or at least the resident memory observed did not fluctuate to high values).

This led me to wonder if you were actually seeing a problem common to both GCs. Not sure if this has been helpful in end however! Windbg is showing 0 for all those gc_heap values:

0:000> ?? libcoreclr!SVR::gc_heap::global_regions_to_decommit
SVR::region_free_list [3] 0x00007fdb`44ffaea0
   +0x000 num_free_regions : 0
   +0x008 size_free_regions : 0
   +0x010 size_committed_in_free_regions : 0
   +0x018 num_free_regions_added : 0
   +0x020 num_free_regions_removed : 0
   +0x028 head_free_region : (null) 
   +0x030 tail_free_region : (null) 
0:000> ?? libcoreclr!SVR::gc_heap::global_regions_to_decommit[0]
SVR::region_free_list
   +0x000 num_free_regions : 0
   +0x008 size_free_regions : 0
   +0x010 size_committed_in_free_regions : 0
   +0x018 num_free_regions_added : 0
   +0x020 num_free_regions_removed : 0
   +0x028 head_free_region : (null) 
   +0x030 tail_free_region : (null) 
0:000> ?? libcoreclr!SVR::gc_heap::global_regions_to_decommit[1]
SVR::region_free_list
   +0x000 num_free_regions : 0
   +0x008 size_free_regions : 0
   +0x010 size_committed_in_free_regions : 0
   +0x018 num_free_regions_added : 0
   +0x020 num_free_regions_removed : 0
   +0x028 head_free_region : (null) 
   +0x030 tail_free_region : (null) 

Sorry if this is unrelated/unhelpful, can open a different issue. I'll double check against .NET6, very possibly something we've caused here too.

Maoni0 commented 1 year ago

hi @dave-yotta, if your heap is actually growing, then it's a distinctly different issue from what I mentioned above. if you could open a new issue so we can track them better, that'd be great!

would it be possible to capture a top level GC trace? that's the first step at diagnosing a memory problem. it's described here. it's very low overhead so you can keep it on for a long time. if this problem shows up pretty quickly you could start capturing right before the process is started and terminate tracing when it's exhibited the "memory not being released and the heap size is too large" behavior.

iif you cannot repro with libclrgc, that's most likely a problem in GC so we'd like to track this down with your help. thanks!

dave-yotta commented 1 year ago

hey @Maoni0, the (used) heap isn't growing, unmanaged memory is growing. not sure if that's actually the heap free space or something else - but there's a lot of GC time and we found a lot of allocations/deallocations totalling 12gb (but never exceeding about 300mb at any one point), will try reducing the memory traffic...and I'll run that gc-collect trace before I do. Take awhile to get around to though sorry! :D

Maoni0 commented 1 year ago

no worries. whenever you get a chance, a gc-collect trace would be very helpful to us.

theolivenbaum commented 1 year ago

@Maoni0 quick update, just tested the latest runtime without setting COMPlus_GCName=libclrgc.so, and the container in question always crashes with OOM when starting (there's a memory-intensive load phase when starting, but there's also enough memory for it to happen). With libclrgc.so it starts without issues.

Maoni0 commented 1 year ago

@theolivenbaum do you have a dump when it gets OOM that you could share? if there's privacy concerns, could you capture a top level GC trace so we can at least understand if "when starting" means "when starting and still in the initialization phase" or "after it's done some GCs"?

theolivenbaum commented 1 year ago

@Maoni0 I'm having issues capturing a dump inside a container. Managed to install the dotnet tools but gcdump gives incomplete results, and dump just fails with an error related to not running as root user

Update: This is the error message from dotnet-dump: Problem launching createdump (may not have execute permissions): execve(/app/createdump) FAILED Permission denied (13)

Maoni0 commented 1 year ago

what about dotnet trace?

theolivenbaum commented 1 year ago

How can I get a memory dump using dotnet-trace?

Maoni0 commented 1 year ago

you don't. you capture a GC trace -

if there's privacy concerns, could you capture a top level GC trace so we can at least understand if "when starting" means "when starting and still in the initialization phase" or "after it's done some GCs"?

hoyosjs commented 1 year ago

@Maoni0 Maoni Stephens FTE I'm having issues capturing a dump inside a container. Managed to install the dotnet tools but gcdump gives incomplete results, and dump just fails with an error related to not running as root user

Update: This is the error message from dotnet-dump: Problem launching createdump (may not have execute permissions): execve(/app/createdump) FAILED Permission denied (13)

@theolivenbaum can you make sure /app/createdump:

theolivenbaum commented 1 year ago

@Maoni0 @hoyosjs good news: found the issue and it was not related to the .NET runtime. The memory allocator used by RocksDB by default on Linux can severally leak memory, and switching to Jemalloc fixed the issue on the server we're observing the problem. Thanks again for the support and we can close the issue now!

NKnusperer commented 1 year ago

Has this really been resolved? We observed the same issue and have mitigated it since then using COMPlus_GCName=libclrgc.so and are not using RocksDB (or this is some kind of embedded dependency for the .NET runtime?).

Maoni0 commented 1 year ago

have you tried preview 5? if you are still seeing OOM without using libclrgc.so, is it possible to share a dump with us?

NKnusperer commented 1 year ago

Do you mean .NET 8 Preview 5? I'm talking about Net 7. If this has been fixed with .Net 8 do we get a backport to .Net 7 ?

Maoni0 commented 1 year ago

yeah, .net 8 preview 7. if you cannot try it, could you share a dump from .net 7 but without using libclrgc.so? you may or may not be hitting the same issue that other people hit so there's no guarantee that even if we backported it would fix the issue you hit.

you could also look at the symbols I mentioned above in a dump yourself.

mangod9 commented 1 year ago

Also @NKnusperer might make sense to create a separate issue for it, since there could be different reasons for OOMs.

markples commented 2 months ago

Reopening - repro given in https://github.com/dotnet/runtime/issues/78959#issuecomment-1453461890 and derivatives from it (all 16MB allocations) are not all solved