dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.29k stars 9.96k forks source link

Health Check Causes Memory Leak in .NET 8 #54405

Open debracey opened 7 months ago

debracey commented 7 months ago

Is there an existing issue for this?

Describe the bug

In a dot net core 8 application, health checks executed on a regular interval eventually leak all available memory, causing the container to crash.

Expected Behavior

Health check should not leak memory. Memory usage should remain relatively consistent or at least be reclaimed by garbage collection.

Steps To Reproduce

Reproduced with the following system configuration:

  1. Setup new dot net core web API application using .NET 8 as the target framework
  2. Modify the auto generated Program.cs to include the following block of code
        builder.Services.AddHealthChecks();

        var app = builder.Build();
        app.MapHealthChecks("/hc", new HealthCheckOptions
        {
            AllowCachingResponses = false,
        });

From there, run the program with a memory profiler (I used DotMemory version 2023.3.3). Notice that each time /hc is called, the health check leaks ~100KB of RAM.

Now, modify the *.csproj file to change the target framework to dot net 6;

<TargetFramework>net6.0</TargetFramework>

Note:

Now repeat the profiling exercise. Notice that, although the memory is not 100% constant, it doesn't really exhibit "leaking" behavior.

Please advise as to how to proceed. If you setup a docker container to poll the service at ~10 second intervals, you'll end up with hundreds of MBs of RAM leaked over a few hour timeframe.

Severity: SEVERE; risk of production crash. Framework is effectively not usable.

Exceptions (if any)

Memory should not leak during basic health check.

.NET Version

8.0.102

Anything else?

Similar to issue reported in the runtime itself, https://github.com/dotnet/runtime/issues/99304

debracey commented 7 months ago

Additional notes;

Did some more tests here, again using the sample API with a GET endpoint of /weatherforecast

The problem here is not the ~100KB that is leaked, the problem is the compounding nature of the requests. If the HC is polled by container infrastructure where the container orchestration specifies a max memory allocation, we'll eventually run out of RAM and crash.

Kiechlus commented 6 months ago

I'm trying to reproduce this, as we are also experiencing memory problems with .Net 8 which most of them seemed to have disappeared with 8.0.2. In Startup.cs, we do:

var healthChecksBuilder = services.AddHealthChecks();
healthChecksBuilder.AddCheck(...)
...
app.UseHealthChecks("/health", new HealthCheckOptions() {
  AllowCachingResponses = false,
});

I then run this locally in a alpine:8.0 container and call the health endpoint in a loop. When I do docker stats, I don't see the memory rising. Is there anything I am missing?

debracey commented 6 months ago

@Kiechlus I realized after I posted this that the SDK we're using is actually 8.0.1 - I'm not sure why that would make a difference, but I will try to upgrade the SDK to 8.0.2 and see if that fixes the issue.

Also, my container base image isn't alpine, it's photon. Again, not sure if that makes a difference.

I am out of the office this week but should be able to test this out next week

GimmeDaKitty commented 6 months ago

@debracey If you could keep us posted after you are back on whether upgrading to 8.0.2 fixed the issue taht would be appreciated. We are seeing unexplained memory 'leaks' after upgrading all our (kubernetes) services to .NET 8. Any data point will be helpful for us.

debracey commented 6 months ago

@GimmeDaKitty, @Kiechlus I Went back and looked at my notes on this now that I am back in office. In the test described above I wasn't using a container at all, I was just running the service directly on windows out of my IDE (Rider 2023.3.3).

I upgraded my SDK to 8.0.203 and the issue seems to be resolved. I no longer see memory increasing (in the unmanaged memory section) with each call to HC.

I'll reach out to my infrastructure team to see what SDK our build pipelines use. I'll report back again once I can retest with an actual productionalized service.

danmoseley commented 4 months ago

@debracey is this still resolved for you?

@GimmeDaKitty @Kiechlus are you still seeing this on latest versions (8.0.4) ?

Wondering whether this can be closed. Part of the reason I'm here is it seems several issues have contributed to https://github.com/dotnet/runtime/issues/95922 and to make progress on the whole we need to look at them all.

mangod9 commented 4 months ago

Adding @andreiZi, who might still have a repro for this.

danmoseley commented 4 months ago

Linking that comment : https://github.com/dotnet/runtime/issues/95922#issuecomment-2105189870

From what I observed, my assumption is that the microservice receives health check requests every 15 seconds, and it seems that something is not being cleaned up properly with each request.

@andreiZi is there a way for you to repeat the experiment but with health checks disabled?

andreiZi commented 4 months ago

Hello @danmoseley,

I've set up two scenarios as follows:

Scenario 1: Build Services with ASP.NET 8.0.4 and SDK 8.0.203 (Healthchecks enabled)

Here is the Dockerfile used:

FROM mcr.microsoft.com/dotnet/aspnet:8.0.4 AS base
WORKDIR /app
EXPOSE 80
EXPOSE 443

ENV ASPNETCORE_URLS=http://*:80

FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0.203  AS build
WORKDIR /src
ARG TARGETARCH

COPY ["BuildingBlocks/BuildingBlocks.Common/BuildingBlocks.Common.csproj", "BuildingBlocks/BuildingBlocks.Common/"]
COPY ["BuildingBlocks/BuildingBlocks.DataAccess/BuildingBlocks.DataAccess.csproj", "BuildingBlocks/BuildingBlocks.DataAccess/"]

COPY ["Services/User/User.Api/User.Api.csproj", "Services/User/User.Api/"]
COPY ["Services/User/User.Application/User.Application.csproj", "Services/User/User.Application/"]
COPY ["Services/User/User.Core/User.Core.csproj", "Services/User/User.Core/"]
COPY ["Services/User/User.DataAccess/User.DataAccess.csproj", "Services/User/User.DataAccess/"]

RUN dotnet restore -a ${TARGETARCH} "Services/User/User.Api/User.Api.csproj"

COPY . ./
WORKDIR "Services/User/User.Api/"
RUN dotnet build "User.Api.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "User.Api.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .

ENTRYPOINT ["dotnet", "User.Api.dll"]

Scenario 2: Deploy Services Completely Without Health Checks

My implementation of the health checks is as follows:

  services.AddHealthChecks();
  app.UseEndpoints(endpoints =>
      {
          endpoints.MapControllers().AddAsyncValidationFilter();
          endpoints.MapHealthChecks("/health");
      });

I will now observe and monitor the behavior. Probably in a few hours, I will report my observations.

andreiZi commented 4 months ago

Unfortunately, this approach did not solve the problem.

This morning, both services reached nearly 500MiB - 700MiB of RAM consumption. I managed to extract dumps from both services. Here is the SharePoint link where you can access them: SharePoint Link.

Below are the outputs from dumpheap -stat showing the largest memory consumers for each system:

System with Health Checks Completely Disabled (Running in Kubernetes with ASP.NET 8.0.4 and SDK 8.0.203):

7fc5f52a00c8  21,366     683,712 Serilog.Events.StructureValue
7fc5f311bf28  21,369     854,760 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7fc5f3ae8948  21,573     864,880 Serilog.Events.LogEventProperty[]
7fc5f5299c70  22,369   1,789,520 Serilog.Events.LogEvent
7fc5f52ab110  22,369   1,789,520 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7fc5f5a2db20   5,051   2,207,624 System.Collections.Generic.KeyValuePair<System.String, Serilog.Events.LogEventPropertyValue>[]
7fc5f07b89e0  27,007   2,640,180 System.Int32[]
7fc5f4e2e680 181,134   4,347,216 Serilog.Events.ScalarValue
7fc5f2e6b438 137,286   4,393,152 Serilog.Events.LogEventProperty
7fc5f08863b8   6,140   8,878,194 System.Char[]
7fc5f52abf90  26,471  11,840,640 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7fc5f07bd7c8 184,071  43,804,328 System.String
7fc5f1349800   4,028  50,552,143 System.Byte[]
55ffb238c7a0  35,601 180,890,880 Free
Total 800,018 objects, 320,912,524 bytes

System with Health Checks Enabled (Running in Kubernetes with ASP.NET 8.0.4 and SDK 8.0.203):

Example 1

7f4d61560980    54,053   1,297,272 System.Int32
7f4d66854398     5,001   2,104,024 System.Collections.Generic.KeyValuePair<System.String, Serilog.Events.LogEventPropertyValue>[]
7f4d63f3ce10        18   2,408,488 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent[]
7f4d660c4840   137,406   4,396,992 Serilog.Events.StructureValue
7f4d64909150   145,247   5,665,856 Serilog.Events.LogEventProperty[]
7f4d63c8b438   263,604   8,435,328 Serilog.Events.LogEventProperty
7f4d63f3bf28   251,316  10,052,640 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7f4d616a63b8    15,677  11,471,428 System.Char[]
7f4d615d89e0   263,406  16,636,700 System.Int32[]
7f4d660bd788   252,316  20,185,280 Serilog.Events.LogEvent
7f4d660cf888   252,316  20,185,280 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7f4d62169800     4,579  30,148,160 System.Byte[]
7f4d65c625b8 1,616,214  38,789,136 Serilog.Events.ScalarValue
7f4d660f0810   261,573  67,245,936 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7f4d615dd7c8   311,686  74,194,268 System.String
55716dec4620   280,214 116,343,616 Free
Total 4,349,002 objects, 441,525,143 bytes

Example 2:

Microsoft.AspNetCore.Server.Kestrel.Core.Internal.ConnectionLogScope+<GetEnumerator>d__7
7f49cb441260    14,776     354,624 Microsoft.Extensions.Diagnostics.HealthChecks.HealthStatus
7f49cb730638     8,460     473,760 System.IO.StringWriter
7f49c7e63c38    13,205     633,840 System.Text.StringBuilder
7f49c6c140f0    37,077     889,848 System.Double
7f49c6ba0980    96,629   2,319,096 System.Int32
7f49c957ce10         7   4,751,336 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent[]
7f49cb703148   247,891   7,932,512 Serilog.Events.StructureValue
7f49c92cb438   282,483   9,039,456 Serilog.Events.LogEventProperty
7f49c9f49150   253,965   9,734,272 Serilog.Events.LogEventProperty[]
7f49c6ce63b8    14,298  12,677,154 System.Char[]
7f49c957bf28   494,089  19,763,560 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7f49c6c189e0   499,985  29,983,368 System.Int32[]
7f49c77a9800     4,419  34,278,207 System.Byte[]
7f49cb6fcc78   494,139  39,531,120 Serilog.Events.LogEvent
7f49cb70e190   494,139  39,531,120 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7f49cb2a25b8 3,117,157  74,811,768 Serilog.Events.ScalarValue
7f49cb70f010   498,426 118,985,088 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7f49c6c1d7c8   492,305 126,014,400 System.String
andreiZi commented 4 months ago

I've continued to observe the systems and have noticed a significant difference in memory consumption.

On the system where health checks are enabled, there is definitely higher memory usage. Conversely, the development system shows a much more stable memory consumption curve. When compared to an environment running .NET 6, this represents a slight increase—about 50-100MB more.

Here, you can see the API gateway's memory usage over the last 9 hours, which stabilized around 8 PM: image

n contrast, this is the system with health checks enabled, which saw memory usage ramp up to 762MB over 4 hours: image

I suspect that .NET 8 generally consumes more memory than .NET 6. I don’t mind the increase, but can you confirm if this is right?

I will continue to monitor the situation and hope that the workaround on the development system proves effective. Although health checks are crucial for us, resolving the memory issue would be highly beneficial.

debracey commented 4 months ago

I can confirm what @andreiZi posted above. The leak appears relatively slow at first, and then eventually GC will kick in (albeit much later than .NET 6), and stabilize stuff for a while before memory creeps up slowly again repeating the cycle. You can forcibly trigger the leak by hammering the health check at < 10 second intervals. The leak is made worse if your HealthCheckResult returns a large amount of string data in the dictionary (e.g., the name + version of every DLL your app uses).

My statement before re/this problem being solved with 8.0.203 is only partially correct. Some of our microservices are still leaking even at 8.0.203, but it's difficult for me to track down because they may have other memory issues that we're only now finding as a byproduct of this investigation. That being said, I still stand by my prior statement that 8.0.203 improves the situation significantly.

However, in speaking with my colleagues that handle overall environment scaling; we believe that this issue still needs to be investigated because even a slow leak will mess up the pod scaling algorithms that we have in place, and we'll eventually be paying for a bunch of executing pods unnecessarily.

andreiZi commented 4 months ago

I need to correct my previous assumption that disabling health checks resolved our issue; the services still ran out of RAM. As @debracey mentioned, there could be multiple reasons causing the leak. In my dump, System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[] is one of the largest consumers of memory.

@debracey, we are facing the same issue. Unfortunately, it was my oversight in missing the memory leak problem when upgrading the services to .NET 8 in production. Now, we find ourselves in a situation where our HPAs are scaling the pods excessively, much like wildfire. Manual intervention is required because we do not wish to change our deployment configuration.

image

andreiZi commented 4 months ago

Here you can see the service I posted about 1 hour ago. Just as @debracey described, it stabilizes the memory consumption for a while, but now we can clearly see that it starts to increase again. image

Here are the last relevant symbols from the output of dumpheap -s:

7fca8bca63b8     312    455,082 System.Char[]
7fca906bdb58  22,620    723,840 Serilog.Events.StructureValue
7fca8e28b438  23,462    750,784 Serilog.Events.LogEventProperty
7fca8e53bf28  22,623    904,920 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7fca8ef08948  23,187    932,320 Serilog.Events.LogEventProperty[]
7fca906b7718  22,623  1,809,840 Serilog.Events.LogEvent
7fca906d8bb0  22,623  1,809,840 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7fca8bbd89e0  23,440  1,967,548 System.Int32[]
7fca9024e6b0 191,221  4,589,304 Serilog.Events.ScalarValue
7fca906d9a30  22,905  8,306,688 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7fca8bbdd7c8 134,245 26,716,442 System.String
7fca8c769800   4,299 50,249,261 System.Byte[]
560b529c1790  13,560 75,716,040 Free
debracey commented 4 months ago

Yeah. Same issue. My graphs look basically the same. (I'm not able to post dumps/graphs of my services due to company policy).

danmoseley commented 4 months ago

Thank you both. Did you try looking at what is rooting some of those dictionaries? !gcroot on some randomly selected ones

andreiZi commented 4 months ago

Hey @danmoseley
Yes, i consistently receive the following output: Found 0 unique roots.

debracey commented 4 months ago

Yes. We also see the same. I thought the output was "No roots found" -- but I'm not at my work computer so presumably @andreiZi has the correct output.

andreiZi commented 4 months ago

@danmoseley

I've obtained a dump from one of my previous .NET 6 applications for reference. These applications are essentially the same as the .NET 8 versions, with only upgraded packages and a few syntax adjustments.

Here is the output (note: the service has been running for over 156 days):

7fc38b6d5270 41,696  1,000,704 Serilog.Events.ScalarValue
7fc38862e830 22,357  1,073,136 System.Text.StringBuilder
7fc3898ee320 38,939  1,246,048 Serilog.Events.LogEventProperty
7fc38beb6460  7,854  1,958,448 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7fc387cc9190 10,529  3,262,729 System.Byte[]
7fc3872f67d8 22,508 11,251,310 System.Char[]
7fc3872a1038 40,739 19,927,984 System.String
Total 461,985 objects, 57,327,493 bytes

I can observe the same symbols as in .NET 8, but with significantly less memory consumption. As I mentioned, the service has been running for an extended period of time.


Here is the graph showing the memory consumption over the last 24 hours. The service experienced moderate usage, but not much load. (Don't mind the small spike at the end; it occurred when I created a dump.) image

I have the opportunity to share as much information as necessary since the development system of the application is very young and doesn't contain any sensitive information. If you need more details, please let me know.

danmoseley commented 4 months ago

Found 0 unique roots.

@mangod9 can you advise?

andreiZi commented 4 months ago

Removing Serilog somehow resolved the issue. I'll now try re-enabling the health checks to see if that was the cause.

image

@debracey are you using also Serilog in your company ?

mangod9 commented 4 months ago

Ok interesting, did you have to upgrade SeriLog when moving to 8 or is it the same version across 6 and 8? We will also look at the dump you have shared on the other issue to figure out if we can find the rooting for those Serilog dictionaries.

debracey commented 4 months ago

We don’t use Serilog. We use NLog, I’m not sure of the version or the upgrade history on it. I’ll check it on Monday and see when we upgraded it and if a new version is available.

andreiZi commented 4 months ago

I suspect that the package Serilog.Sinks.Grafana.Loki might be causing the memory leak. After removing it, the memory usage trends look much better. Before jumping to any conclusions, I'll continue to monitor the situation. I had considered removing this package anyway but wanted to share this information here for anyone experiencing similar memory issues and possibly using this package.

@mangod9, does .NET's garbage collector clean up memory that might be improperly handled by a package?

Memory Usage Trends (Health checks & Serilog now enabled)

image

I also updated Serilog when transitioning to .NET 8. The Serilog.Sinks.Grafana.Loki package is on the latest version.

My assumption is that this package caches logs when Loki is unreachable (we haven’t set up Loki in the system yet), but this is just a hypothesis.

andreiZi commented 4 months ago

In the previous dumps, we also observed the symbol:

System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]

This symbol was not present in .NET 6 and appears to be growing. Previously, we only saw:

System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>

After removing the Serilog.Sinks.Grafana.Loki package, this symbol also vanished.

andreiZi commented 4 months ago

My particular issue was resolved using the approach I described above. The package had an infinite queue limit when Loki was unreachable, which filled up the memory until an OutOfMemoryException was thrown.

Initially, it was a combination of both health checks and the leaking package. However, with ASP.NET 8.0.4 and SDK 8.0.203, along with the proper adjustments to the logger configuration, all my issues were resolved.

@debracey, if your error still persists, I recommend getting a dump of the process and inspecting the highest consumers. In my case, it was System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[] which pinpointed the issue for me.

After weeks of debugging and head-cracking, I can finally utilize those C#12 features 🚀. Thank you for the help, everyone; you do a magnificent job!

debracey commented 4 months ago

I went back and tested this again. I've now updated my local SDK to dot net 8.0.300 and the actual now reports the information below. My services don't use SeriLog, so there was nothing to update there. NLog is what we use and that is/was @ 5.2.4 (latest = 5.3.2).

I tested at 5.2.4 to keep things consistent. I also tested at 5.3.2 for comparison.

The situation is much improved, but we're still seeing leaks - it's similar to this other post about dot net 8 leaks.

Unfortunately, I cannot post the full debug screenshot by company policy. However. here's an overview of what it looks like just for String.

Generation Number of Object
Gen 0 69,991
Gen 1 96
Gen 2 4,345
Total 74,432

It is possible that there's something in our own codebase that is exacerbating the issue. But it seems really strange that we're retaining 70K strings in gen 0. All we're doing is sitting around health checking.

Test Environment:

Host:
  Version:      8.0.4
  Architecture: x64
  Commit:       2d7eea2529
  RID:          linux-x64

.NET SDKs installed:
  No SDKs were found.

.NET runtimes installed:
  Microsoft.AspNetCore.App 8.0.4 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 8.0.4 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

cc/ @andreiZi , @danmoseley

debracey commented 4 months ago

Did further research on this, specifically look at the issues linked here:

  1. Guard against -1 returned from sysconf
    1. Output of getconf -a | grep LEVEL4_CACHE on my photon base image returns 0
    2. So it would seem that I am not hitting that specific issue
  2. Upgrading to dot net 8 causes a sudden increase in LOH size
    1. Adding an ENV DOTNET_GCgen0Size=0x1E00000 command immediately before starting my application caused the results outlined below.

At first adding DOTNET_GCgen0Size=0x1E00000 looked promising. But after a while (30 mins or so) the same problem occurred. Memory started creeping up, and even after an apparent GC event where the memory dropped back down, it never dropped back to the baseline and would creep back up. The number of objects stuck in the various generations is higher than my previous post. Again looking only at strings, Gen0 had 175K unreachable strings. The other generations were similar.

So, still basically stuck. Something seems to be causing a huge number of objects -- mostly strings -- to remain in Gen0

jbroumels commented 4 months ago

Did further research on this, specifically look at the issues linked here:

1. [Guard against -1 returned from sysconf](https://github.com/dotnet/runtime/pull/100502#issue-2219207285)

   1. Output of `getconf -a | grep LEVEL4_CACHE` on my photon base image returns 0
   2. So it would seem that I am not hitting that specific issue

2. [Upgrading to dot net 8 causes a sudden increase in LOH size](https://github.com/dotnet/runtime/issues/101906#issuecomment-2096685440)

   1. Adding an `ENV DOTNET_GCgen0Size=0x1E00000` command immediately before starting my application caused the results outlined below.

At first adding DOTNET_GCgen0Size=0x1E00000 looked promising. But after a while (30 mins or so) the same problem occurred. Memory started creeping up, and even after an apparent GC event where the memory dropped back down, it never dropped back to the baseline and would creep back up. The number of objects stuck in the various generations is higher than my previous post. Again looking only at strings, Gen0 had 175K unreachable strings. The other generations were similar.

So, still basically stuck. Something seems to be causing a huge number of objects -- mostly strings -- to remain in Gen0

@debracey shouldn't DOTNET_GCgen0Size be DOTNET_GCgen0size ? (lowercase s)

Reference: https://github.com/search?q=repo%3Adotnet%2Fruntime%20GCgen0size&type=code

debracey commented 4 months ago

@jbroumels The other post mentioned capital S. I can try again with lowercase s --- will report back