Possible memory leak in .Net 8.0.2

cr0fters commented 4 months ago

Description

I have a .Net MVC web app, that I've recently upgraded to 8.0.2 and just recently it seems to be having memory issues. The app in question runs on 2 AWS ECS Fargate tasks, and in normal usage steadily uses on average 10-15% memory. Whenever the memory issue kicks in, it tends to jump pretty quickly up to ~80% usage and then flat-line before slowly increasing to 100% (then crashing).

This behaviour only really started when we upgraded from .Net 7 to .Net 8.0.0, and had hoped the recent 8.0.2 release a few weeks ago would fix this. It hasn't actually occurred for a few weeks until just today, where it's happened each time I force a new deployment.

Here's a screenshot showing the memory usage for the running task today:

I've ran a dump on one of the running tasks, and downloaded the file locally. The file itself is 2.2Gb in size, (with 80% usage being reported by ECS), however when I analyse the dumpfile (via JetBrains DotMemory and also dumpheap stat), they both report just over 100Mb on the heap (which is expected after the app has only been running for around 30 minutes.

Here are a few screenshots of the results from these tools:

Reproduction Steps

Unfortunately I've been unable to reproduce this locally. It seems to be very intermittent, in that it hasn't happened in weeks, but when it does it happens back to back a few times.

Expected behavior

Steady memory usage over time

Actual behavior

Memory usage rises suddenly, plateaus at ~80%, before slowly increasing to 100%

Regression?

It previously worked fine in .Net 6 and 7

Known Workarounds

No response

Configuration

No response

Other information

No response

ghost commented 4 months ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details

### Description I have a .Net MVC web app, that I've recently upgraded to 8.0.2 and just recently it seems to be having memory issues. The app in question runs on 2 AWS ECS Fargate tasks, and in normal usage steadily uses on average 10-15% memory. Whenever the memory issue kicks in, it tends to jump pretty quickly up to ~80% usage and then flat-line before slowly increasing to 100% (then crashing). This behaviour only really started when we upgraded from .Net 7 to .Net 8.0.0, and had hoped the recent 8.0.2 release a few weeks ago would fix this. It hasn't actually occurred for a few weeks until just today, where it's happened each time I force a new deployment. Here's a screenshot showing the memory usage for the running task today: ![image](https://github.com/dotnet/runtime/assets/1754858/c0cb0853-6d73-436e-beae-e5f0309daa9e) I've ran a dump on one of the running tasks, and downloaded the file locally. The file itself is 2.2Gb in size, (with 80% usage being reported by ECS), however when I analyse the dumpfile (via JetBrains DotMemory and also `dumpheap stat`), they both report just over 100Mb on the heap (which is expected after the app has only been running for around 30 minutes. Here are a few screenshots of the results from these tools: ![image](https://github.com/dotnet/runtime/assets/1754858/dcf06da7-325f-4fb3-bca3-149287031429) ![image](https://github.com/dotnet/runtime/assets/1754858/f076ae6b-9027-4ca4-9a40-904177af6f97) ### Reproduction Steps Unfortunately I've been unable to reproduce this locally. It seems to be very intermittent, in that it hasn't happened in weeks, but when it does it happens back to back a few times. ### Expected behavior Steady memory usage over time ### Actual behavior Memory usage rises suddenly, plateaus at ~80%, before slowly increasing to 100% ### Regression? It previously worked fine in .Net 6 and 7 ### Known Workarounds _No response_ ### Configuration _No response_ ### Other information _No response_

Author:	cr0fters
Assignees:	-
Labels:	`area-GC-coreclr`, `untriaged`
Milestone:	-

debracey commented 4 months ago

I am also seeing this issue, in my case it is causing our microservices (running in docker) to run OOM and crash. I was able to reproduce the bug as follows;

Setup new dot net core web API application using .NET 8 as the target framework
Modify the auto generated Program.cs to include the following block of code

        builder.Services.AddHealthChecks();

        var app = builder.Build();
        app.MapHealthChecks("/hc", new HealthCheckOptions
        {
            AllowCachingResponses = false,
        });

From there, run the program with a memory profiler (I used DotMemory version 2023.3.3). Notice that each time /hc is called, the health check leaks ~100KB of RAM.

Now, modify the *.csproj file to change the target framework to dot net 6;

<TargetFramework>net6.0</TargetFramework>

Note:

You will need to remove the OpenAPI reference in the *.csproj
You will also need to remove the.WithOpenApi(); extension in Program.cs

Now repeat the profiling exercise. Notice that, although the memory is not 100% constant, it doesn't really exhibit "leaking" behavior.

Please advise as to how to proceed. If you setup a docker container to poll the service at ~10 second intervals, you'll end up with hundreds of MBs of RAM leaked over a few hour timeframe.

Framework information;

SDK: 8.0.102
Runtime: 8.0.2

Severity: SEVERE; risk of production crash. Framework is effectively not usable.

mangod9 commented 4 months ago

Just checking whether the scenario requires MapHealthChecks to be enabled? If so we might have to move this to the asp.net

cr0fters commented 4 months ago

I'm not using MapHealthChecks in the app I'm seeing the issue on. Also if that were the case I assume that would appear in managed heap (and visible in DotMemory).

See my screenshots above, according to ECS, I'm using 80% of available memory (2 Gb), however when I perform a dotnet dump the analysis only shows around 100Mb of usage.

debracey commented 4 months ago

@mangod9 If I don't call MapHealthChecks there's no health check endpoint for me to query for testing purposes. If I instead use the sample weather forecast endpoint, I do not see the leak.

Is there some way to activate the health check via a REST endpoint without first setting it up via MapHealthChecks?

mangod9 commented 4 months ago

thanks for clarifying @cr0fters. So in your case you don't see the managed heap growing much as the memory increases? Are you able to share a dump of before / after so we could investigate further?

cr0fters commented 4 months ago

I could share a dump from after if that helps? I didn't get a before dump, and we've since deployed it on a different dotnet base image to see if it makes a difference (8.0-alpine).

The dump file is however 2.2Gb, and also I'd not be comfortable sharing it in public either way. Do you have a more secure way I could share with you?

debracey commented 4 months ago

Did some more tests here, again using the sample API with a GET endpoint of /weatherforecast

If I poll /weatherforecast at ~10 second intervals:
- No memory leak is observed
- Memory usage is more or less consistent, regardless of if call is made to builder.Services.AddHealthChecks();
If I poll the /hc endpoint (see sample code above)
- Memory leak is observed
- The average leak per call to /hc is ~100KB (some health checks leak more than this, some less, some none at all)

The problem here is not the ~100KB that is leaked, the problem is the compounding nature of the requests. If the HC is polled by container infrastructure where the container orchestration specifies a max memory allocation, we'll eventually run out of RAM and crash.

mangod9 commented 4 months ago

so appears there are two separate issues here. @debracey since yours looks related to health checks might make sense to create a new issue in the asp.net repo.

mangod9 commented 4 months ago

I could share a dump from after if that helps? I didn't get a before dump, and we've since deployed it on a different dotnet base image to see if it makes a difference (8.0-alpine).

The dump file is however 2.2Gb, and also I'd not be comfortable sharing it in public either way. Do you have a more secure way I could share with you?

Yeah we can provide a share for you to upload the dump. Can you please start an email so we could coordinate over it.. my email should be in the profile. Thx.

Neme12 commented 3 months ago

@debracey Is your issue similar to the one from @cr0fters in that the leaked memory also isn't on the managed heap? Or is it on the managed heap in your case?

Neme12 commented 3 months ago

Does anyone else have this issue as well? I assume people do because of all those thumbs up on the issue. If so, could anyone provide any details?

debracey commented 3 months ago

@Neme12 Yes, the leaked memory appears to be in the unmanaged space. A few other engineers and I have been working through this issue and we're now trying to debug the unmanaged memory to gain more details. We haven't gotten very far on that yet.

Although I went ahead and opened the linked bug with the asp .net core project, I think this is actually the same bug. I'm just using a health check to trigger the bug whereas @cr0fters triggered it via a different path.

Neme12 commented 3 months ago

@mangod9 I think this should be prioritized, it looks like it's not just an inconvenience of applications taking more memory than necessary, but it's causing apps to crash and it's preventing some to upgrade to .NET 8. And multiple people are having the issue, both in this thread and the one in dotnet/aspnetcore#54405

mangod9 commented 3 months ago

yeah @cr0fters was going to send a dump. But if there are other folks who can share dumps we can investigate. I will let the asp.net team look into the specific healthcheck issue

KamilRaszkiewicz commented 3 months ago

I noticed similar behaviour recently in diagnostic tools. I've found that there are lots of duplicated strings in Datagrams Received event log after refreshing /hc endpoint for a while. There is possibility that those strings are duplicated only when using diagnostic tools, but i'll leave it to check for you.

I think that the cause of problem is here - we are building the same string over and over. https://github.com/dotnet/runtime/blob/ca48a0d0f733e3477738041b28a624411ee9afd6/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/PollingCounter.cs#L66

mangod9 commented 2 months ago

Is this issue still a concern after 8.0.4 and perhaps a fix for https://github.com/dotnet/runtime/issues/100502 which would be made in 8.0.5?

cr0fters commented 2 months ago

Hi @mangod9 - apologies I didn’t share the dump above - I was waiting for the issue to happen again, but around the same time I switched to an Alpine based image and never had the same problem.

This leads me to believe it’s specific to Debian based images as mentioned elsewhere

mangod9 commented 2 months ago

yeah if switching to Alpine fixes the issue its very likely the same. Ok to close this for now, you can always reopen if it occurs again?

debracey commented 2 months ago

When my team was seeing this before, we were seeing this on photon based images. We don’t use alpine or Debian.

We are still receiving reports from development teams that they’re still seeing un-reclaimed strings, matching the reports/patterns from this ticket.

I am working on narrowing down the possible variables to see if I can pinpoint which SDK versions are working properly. So far my team has seen an improvement with 8.0.204 - but some leaks are still occurring, just at a much slower growth rate.

debracey commented 1 month ago

I am still trying to narrow down what could be causing this. Some of my services are no longer leaking as of 8.0.204, but other services still leak as described in this ticket. There doesn't seem to be a clear pattern. The services are no longer able to downgrade to dot net 6 as they've introduced code changes requiring dot net 8.

@cr0fters do your containers make use of Java for anything, or do you include the JRE (if so which one?) in your container image?

debracey commented 1 month ago

Did further research and updated my findings here

tl;dr Issue is still not resolved as of 8.0.4 with SDK 8.0.300

cr0fters commented 1 month ago

I am still trying to narrow down what could be causing this. Some of my services are no longer leaking as of 8.0.204, but other services still leak as described in this ticket. There doesn't seem to be a clear pattern. The services are no longer able to downgrade to dot net 6 as they've introduced code changes requiring dot net 8.

@cr0fters do your containers make use of Java for anything, or do you include the JRE (if so which one?) in your container image?

no we don’t make any use of Java

mangod9 commented 1 week ago

Checking if this is still an issue with latest 8 servicing release. There have been a few memory leak fixes since 8.0.2

mangod9 commented 1 day ago

Closing since we have fixed a few memory related issues in the latest servicing releases. Please reopen if a leak still exists

dotnet / runtime