dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

Multiple 'System.OutOfMemoryException' errors in .NET 7 #78959

Open theolivenbaum opened 1 year ago

theolivenbaum commented 1 year ago

I'm seeing an issue very similar to this one when running a memory-heavy app on a linux container with memory limit >128GB RAM.

The app started throwing random OutOfMemoryException in many unexpected places since we migrated to net70, while under no memory pressure (usually with more than 30% free memory).

I can see the original issue was closed, but I'm not sure if it was fixed on the final net70 release or if the suggestion to set COMPlus_GCRegionRange=10700000000 is the expected workaround.

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details
I'm seeing an issue very similar to [this one](https://github.com/dotnet/runtime/issues/70718) when running a memory-heavy app on a linux container with memory limit >128GB RAM. The app started throwing random OutOfMemoryException in many unexpected places since we migrated to net70, while under no memory pressure (usually with more than 30% free memory). I can see the [original issue](https://github.com/dotnet/runtime/issues/70718) was closed, but I'm not sure if it was fixed on the final net70 release or if the suggestion to set `COMPlus_GCRegionRange=10700000000` is the expected workaround.
Author: theolivenbaum
Assignees: -
Labels: `area-GC-coreclr`, `untriaged`
Milestone: -
mangod9 commented 1 year ago

Thanks for reporting this issue. This looks like its separate than the original issue -- we are investigating something similar with another customer. Would it be possible to share a dump when the OOM happens?

theolivenbaum commented 1 year ago

Unfortunately not as this is running within a customer infrastructure and the dump would most probably contain confidential data. Is there an issue here on GitHub I can subscribe to?

mangod9 commented 1 year ago

We dont have an issue yet, so will use this to provide updates. Its most likely something which is fixed in main which might need porting to 7: https://github.com/dotnet/runtime/pull/77478

mangod9 commented 1 year ago

Hi @theolivenbaum, would it be possible for you to try out a private to ensure the fix resolves your issue?

Thanks

theolivenbaum commented 1 year ago

That might be hard as it would involve changing how we build our docker images. But fine to wait till this is back ported to 7 - any idea on a timeline for the next service release?

Quppa commented 1 year ago

We're also seeing a lot of OOM exceptions since migrating to .NET 7 from .NET 5 (we're now testing .NET 6). In our case, we're running under Windows via Azure App Services. Reported memory usage is low - perhaps lower than what it was under .NET 5. The project in question loads large-ish files in memory.

mangod9 commented 1 year ago

can you try if setting COMPlus_GCName=clrgc.dll or COMPlus_GCName=libclrgc.so make the OOMs go away? We are working on a fix, but hoping this could be a temporary workaround. Thx.

Quppa commented 1 year ago

We'll try to find time to test this.

jeremyosterhoudt commented 1 year ago

@mangod9 We're seeing a similar issue with .NET7 on Ubuntu loading larger files (10+ MB) with File.ReadAllBytes. This works fine on .NET6

Setting COMPlus_GCName=libclrgc.so resolves the issue for our setup with .NET7

Maoni0 commented 1 year ago

it'd be helpful to see what !ao displays (it's an sos extension). would that be possible? that's always the 1st step if you have a dump.

mangod9 commented 1 year ago

Setting COMPlus_GCName=libclrgc.so resolves the issue for our setup with .NET7

ok good to know, yeah like Maoni suggests getting a dump or trace can help confirm whether its the same issue. We hope to get it fixed in an upcoming servicing release.

jeremyosterhoudt commented 1 year ago

Hopefully I did this right. I followed this guild. Here is the output:

---------Heap 1 ---------
Managed OOM occurred after GC #4 (Requested to allocate 6028264 bytes)
Reason: Could not do a full GC
mangod9 commented 1 year ago

thanks, it does look similar to other cases we have seen.

Maoni0 commented 1 year ago

would it be possible to try out a private fix? we could deliver a libclrgc.so to you and you could use it the same way you used the shipped version. that would be really helpful.

theolivenbaum commented 1 year ago

That would probably be possible! Also while we're at it, is there any recommendation on how to get memory dumps from within a container?

mangod9 commented 1 year ago

I have copied a private libcoreclr.so at https://1drv.ms/u/s!AtaveiZOervriJhkWC64gVEV8dAHug?e=IyBaP3, if you want to give that a try. You will want to remove the COMPlus_GCName config.

theolivenbaum commented 1 year ago

@mangod9 @Maoni0 just got the chance to test today the library you sent, and after a day of usage under load no issues so far!

mangod9 commented 1 year ago

ok thanks for trying it out. We will do additional validation and add it to a .NET 7 servicing release (due to holidays might be in Feb).

theolivenbaum commented 1 year ago

@mangod9 meanwhile what would you recommend? Keep using the version you shared or using some of the flags suggested above?

mangod9 commented 1 year ago

you could keep using the private if that works for you scenario. If you pickup a new servicing release it might not work however. Using COMPlus_GCName might be ok to use temporarily too.

theolivenbaum commented 1 year ago

Thanks! Will keep that in mind then! Just out of curiosity, how come the COMPlus_GCName flag is a workaround? Does the runtime includes two copies of the GC?

mangod9 commented 1 year ago

in .NET 7 we have enabled new Regions functionality within the GC. Here are the details: https://github.com/dotnet/runtime/issues/43844. Since this was a foundational change, we also shipped a separate GC which keeps the previous "Segments" functionality -- in case there are some issues like this one. Going forward, we do plan to use a similar mechanism to release newer GC changes and could have multiple GC implementations sometime in the future.

theolivenbaum commented 1 year ago

That makes a lot of sense and what I was imagine happened! Looking forward then to the service release next year!

qwertoyo commented 1 year ago

👋 Hello! We recently upgraded a series of console apps/generic hosts (and 1 asp.net/webhost), running on alpine linux, from dotnet 6 to dotnet 7, and this issue (OOM while there's plenty of memory available) started happening on some of them, when under load.

From what I can tell, it is not happening on the apps in which we have set the GC mode to server with ENV DOTNET_gcServer 1 in the dockerfile, but only in console apps that have that flag not set (=> client GC mode). Also not happening in the asp.net one, which has that flag set by default AFAIK.

I will try now to set ENV COMPlus_GCName libclrgc.so on those suffering and retest, but a question: do you think enabling server GC mode in those could be another workaround?

theolivenbaum commented 1 year ago

I hit this issue with server GC, so I don't think it will improve it

mangod9 commented 1 year ago

Correct, shouldnt be related to WKS/SVR config. @qwertoyo, might be worthwhile to try the private shared above. We are hoping to release the fix in the next month. Thx!

marcovr commented 1 year ago

We are experiencing a similar issue and setting COMPlus_GCName=libclrgc.so seems to resolve it.

you could keep using the private if that works for you scenario. If you pickup a new servicing release it might not work however. Using COMPlus_GCName might be ok to use temporarily too.

I was wondering: Is the provided preview version of libcoreclr.so still compatible with the latest .NET Version (7.0.1, which was released afterwards) or should we stick with setting the env var?

mangod9 commented 1 year ago

@marcovr, it most likely would be, but would be good to validate your scenario on the latest release just in case. Btw, we have merged the fix into the 7 servicing branch yesterday so it should be available with the Feb servicing release. Thx.

dave-yotta commented 1 year ago

Not sure if this is what we are getting, but the symptoms are:

try { ... } catch(OutOfMemoryException) { Environment.FailFast("Aborting to trap and dump OOM"); }

[createdump] ...dumping a core dump
analyzeoom: could not do a full GC
heaps: 200mb allocated total
GC heap memory limit: ~ 3gb (75%)
(heap) dump file size ~ 1.7gb
kubernetes pod memory limit: 4Gi
kubernetes node: 16gb

so at very low memory bounds

mangod9 commented 1 year ago

This would be fixed in 7.0.3.

NKnusperer commented 1 year ago

Does the 7.0.3 release include a fix?

Quppa commented 1 year ago

7.0.3 includes the fix mentioned above.

We're just waiting for Azure App Services to roll out 7.0.3 so we can confirm if it fixes the issue we encountered.

paulj1010 commented 1 year ago

Just found this thread, we have similar issues in our production environments since upgrading to dotnet7

Our containers are all based on standard docker image mcr.microsoft.com/dotnet/aspnet:7.0

I checked what version they are currently on and it appears to be 7.0.3 but can confirm we are still seeing this.

I'm going to try the suggestion of using COMPlus_GCName=libclrgc.so as others seemed to have success with that.

mangod9 commented 1 year ago

@paulj1010, would you be able to share a dump and/or repro for your scenario? That would help determine if this is the same issue. Thanks.

paulj1010 commented 1 year ago

@mangod9 - That could be a challenge as we have a large numbers of pods in production spread across all regions that are randomly getting this. I'm trying to reliably repro this in a QA lab where we have more control but that's not proving easy (seems to need large volumes of traffic).

I'm going to push out the temporary workaround mentioned above to one of our regions and if we see a dramatic drop after that then it will at least confirm there is some common ground here.

If I confirm the drop (using the workaround) I will take this off GitHub and will raise a premium support issue (I can ping you the ID directly if you want to jump on that) so that we can get some formal response from Microsoft etc..

For what it's worth there is a direct correlation between the update of our services to DotNet 7 and a huge increase in this error.

mangod9 commented 1 year ago

ok, yeah trying with libclrgc should help narrow it down. Let us know how it goes. Thx

paulj1010 commented 1 year ago

@mangod9 - hi, I pushed out that GCName override update last night to one of our smaller regions and it had a big effect, dropping the hourly OOM rate by a couple of orders of magnitude so that was very helpful, thanks for having the foresight to ship out that older style GC engine!

It's not going to be feasible to get dumps from production pods so I'm going to see if we can get this repro'd in a QA lab environment. I'm not 100% familiar with that process - What is it that would be helpful for you and your team? A dotnet-dump before/after one of these events? Or is there something we can put in code to generate the same data at the point of exception? (As it's lab based we can deploy custom code versions etc..)

mangod9 commented 1 year ago

Ok thats interesting. Yeah a dotnet-dump after an OOM should be helpful. @Quppa @qwertoyo @theolivenbaum have you been able to upgrade to 7.0.3 and check if the OOMs have been fixed for your specific cases?

tverboon commented 1 year ago

@mangod9 7.0.3 fixed the OOMs for us. We don't nearly have the traffic and scale Paul is working with. We had 10-100 of OOMs per day after upgrading to .Net 7. Now we had 0 since we released based on 7.0.3. We released the day after .Net 7.0.3 release.

mangod9 commented 1 year ago

@mangod9 7.0.3 fixed the OOMs for us. We don't nearly have the traffic and scale Paul is working with. We had 10-100 of OOMs per day after upgrading to .Net 7. Now we had 0 since we released based on 7.0.3. We released the day after .Net 7.0.3 release.

ok thanks @tverboon for the update. Good to know that the fix solved it for your case.

mangod9 commented 1 year ago

@paulj1010 it would also be helpful to capture GCOnly traces (these shouldnt add much overhead) which might provide info on why the OOMs are being triggered too.

arian2ashk commented 1 year ago

I created a repo with one simple endpoint that is pointing to a lib which easily shows memory usage increasing constantly when using .net 7.0.3 but when using COMPlus_GCName: clrgc.dll variable memory usage stays stable. The repo I am using K6 to run a load test to cause the memory increase. I think if I let it running for a while I will eventually run out of memory and get out of memory exceptions. so I think the issue is not fixed in 7.0.3 I hope this will be useful on finding the issue.

Maoni0 commented 1 year ago

sorry, I was not aware of this till now... apologize for the delay and thank you so much for the repro, @arian2ashk.

I took a quick look and can definitely repro. this is because we are retaining WAY more memory with the default implementation (which we shouldn't). the actual heap doesn't diff that much but after a run with k6 the default impl is retaining a lot of memory in free_regions[1] and a lot in global_regions_to_decommit, eg it retains ~5gb in decommit -

0:077> ?? coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions
unsigned int64 0x00000001`27127000

the app does almost exclusively BGCs. @PeterSolMS, could you please take a look?

PeterSolMS commented 1 year ago

I was able to repro this as well - debugging it I found there is indeed a flaw in our logic that causes us to stop decommitting regions if we do BGCs almost exclusively as in the test case. The fix to this issue should be fairly simple.

Not clear yet why so many regions end up in free_regions[1], perhaps there is a second issue. I will investigate.

PeterSolMS commented 1 year ago

I had another look at the free_regions[1] issue, and the amount of memory there appears to be normal for the repro scenario. The amount of memory in the LOH fluctuates a lot, and so we retain some memory rather than incurring the overhead of decommitting/recommitting the memory.

So that leaves the issue that we are not decommitting the memory we were planning to decommit - I will work out a fix for this.

dave-yotta commented 1 year ago

The original issue we reported in this issue above of OutOfMemoryException at low memory bounds has gone away (with COMPlus_GCName=libclrgc.so, we've not removed that yet). But we might be experiencing something similar to the above now, not sure how to get coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions but we have a few core dumps with a lot of native resident memory looking like this: image And also a warning from dotmemory there that frequent GC are happening (taking 85% of the time it thinks).

Edit It's also interesting I noticed tighter GC performance on one of our processes by turning off server GC, it was shooting to about 2gi in a 4gi limited container, but with <ServerGarbageCollection>false</ServerGarbageCollection> it restricted itself much better at around 0.6gi with no noticable performance impact. Since you mention background GC...

But not sure if we've got other problems kicking around, we're multiprocess in a container and seeing some oomkiller hits on our processes, not sure how the GC accounts for HeapHardLimitPercent which will default to 75% of the cgroups limit for both processes in terms of the amount of allocated memory - I don't think there is a way for it to tell if the allocated memory reported by the system belongs to processes within the running container so I'd expect problems in this multiprocess scenario anyway....

Maoni0 commented 1 year ago

coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions is a symbol that you can dump in a debugger. in windbg you can use ?? to dump it like I showed above.

dave-yotta commented 1 year ago

Sorry I just get

0:000> coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions
Couldn't resolve error at 'oreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions'

in windbg, not familiar with native debugging

we're using the old libclrgc e.g. [0x2] libclrgc!SVR::gc_heap::gc_thread_function+0x72 calls exist on some therads, different version?

Edit: I misread your message , ?? is a command, oops! I get similar error though, I don't think we're talking about the same gc here:

0:000> ?? coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions
Unable to load image /Alloy/AlloyForgeWebApi/Hangfire.JobsLogger.dll, Win32 error 0n2
Unable to load image /Alloy/AlloyForgeWebApi/MongoDB.Driver.Core.dll, Win32 error 0n2
Couldn't resolve error at 'coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions'