dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.19k stars 9.93k forks source link

Memory leak in Health Checks #48240

Closed henridd closed 1 year ago

henridd commented 1 year ago

Is there an existing issue for this?

Describe the bug

Every single request to health checks leaks a bit of memory. Eventually, if the application is ran 24/7 with multiple applications constantly requesting health checks, the memory consumption can get very high.

Expected Behavior

The memory should be freed after the resources have been used.

Steps To Reproduce

  1. Download the sample from https://github.com/dotnet/AspNetCore.Docs/tree/main/aspnetcore/host-and-deploy/health-checks/samples. You can also use https://github.com/henridd/health-check-mem-leaking, which leaks faster.
  2. Edit Program.cs to look like the following:
    
    using HealthChecksSample.HealthChecks;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHealthChecks() .AddCheck("Sample");

var app = builder.Build();

app.MapHealthChecks("/healthz");

app.Run();


3. Run the application
4. Open a tool to monitor the application's memory usage. Can be accomplished through the Process Manager.
5. Wait some seconds for the application to stabilize its memory usage.
6. Issue multiple requests to the healthcheck endpoint (https://localhost:1482/healthz). In my case, I've used Postman to request 500 times.

### Exceptions (if any)

_No response_

### .NET Version

7.0.100

### Anything else?

The initial memory consumption was 20 684 K. After 2000 requests made via Postman, it was 35 412 K. Forcing a GC didn't clear up any memory. 
You can find the memory dump below. The first snapshot was taken after a a few dozen requests, and the second after 2000. I forgot to take a snapshot immediately before starting the requests, so if needed I can run another profile.
[Attach to HealthChecksSample.zip](https://github.com/dotnet/aspnetcore/files/11484497/Attach.to.HealthChecksSample.zip)

.NETCore.App 7.0.0
AspNetCore.App 7.0.0
davidfowl commented 1 year ago

What does it leak? Where's the memory profile? What is it leaking? Did the garbage collector run?

henridd commented 1 year ago

I have edited the ticket with the profile file.

What do you mean by "what does it leak"/"what is it leaking"? I do not have much experience troubleshooting memory leak problems, sorry. If you would have an article to point me to, I would appreciate it.

The GC did run. I tried 2 different tests with a similar code that might give some insight.

using Microsoft.AspNetCore.Diagnostics.HealthChecks;
using Microsoft.Extensions.Diagnostics.HealthChecks;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();
AddAppHealthChecks(builder.Services);

var app = builder.Build();

// Configure the HTTP request pipeline.
if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI();
}

app.UseHttpsRedirection();

UseAppHealthChecks(app);

app.Run();

void AddAppHealthChecks(IServiceCollection services)
{
    _ = services.AddHealthChecks()
        .AddCheck<TestHealthCheck>(nameof(TestHealthCheck), HealthStatus.Unhealthy, new List<string>()
        {
                    nameof(TestHealthCheck)
        });
}

void UseAppHealthChecks(WebApplication webApplication)
{
    webApplication.MapHealthChecks("/health/test", new HealthCheckOptions
    {
        Predicate = healthCheck => healthCheck.Tags.Contains(nameof(TestHealthCheck)),
    });
}

public class TestHealthCheck : IHealthCheck
{
    public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
    {
        //GC.Collect();       
        return HealthCheckResult.Healthy();
    }
}

If you uncomment the GC.Collect() call in the CheckHealthAsync method, the memory consumption becomes way more stable, it doesn't increase for every call to the healthcheck endpoint. It seems almost that the memory leak is gone. But if you would comment that line again, and add a Timer which calls GC.Collect() every 30s, for example, the problem persists.

davidfowl commented 1 year ago

This doesn't sound like a leak then. Please refer to this https://github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md to get a better idea of what might be happening.

henridd commented 1 year ago

Thanks for the great article. I'll take a look into it and update the ticket when I'm done with it

henridd commented 1 year ago

I read the article and I'm still confused about the behavior. Looking at how GCs are triggered, it seems that GCs should be fired, since we are explicitly calling GC.Collect in this test code. But when looking at the GCStats from PerfView, I get the following output: image So even though I'm forcing it, nothing seems to happen. I tried to force a collection every 50 requests, but then PerfView failed to open the GCStats view

Exception Occurred: System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at Stats.GCProcess.<>c__DisplayClass0_0.<Collect>b__13(GCMarkWithTypeTraceData data)
   at Microsoft.Diagnostics.Tracing.Parsers.Clr.GCMarkWithTypeTraceData.Dispatch()
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Dispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLogEventSource.Process()
   at Stats.GCProcess.Collect(TraceEventDispatcher source, Single sampleIntervalMSec, ProcessLookup`1 perProc, MutableTraceEventStackSource stackSource, Boolean _doServerGCReport, TraceLog traceLog)
   at PerfView.PerfViewGCStats.WriteHtmlBody(TraceLog dataFile, TextWriter writer, String fileName, TextWriter log)
   at PerfView.PerfViewHtmlReport.<>c__DisplayClass16_0.<Open>b__0()
   at PerfView.StatusBar.<>c__DisplayClass19_0.<StartWork>b__0(Object <state>)
An exceptional condition occurred, see log for details.

I've tried making code changes and even creating a similar solution from scratch, but hit the same exception over and over.

Also, sorry for not providing the full generated dump here, since I'm not sure about which kind of data we are allowed to share. But if there is any specific data you would like, let me know and I'll provide it.

Anyway, then I tried to run a small GC monitoring on my own. I've made the modifications in this code. This was the code, executed every 200 requests:

GC.Collect();
var info = GC.GetGCMemoryInfo();
Tracker.TotalPromotions += info.PromotedBytes;

Console.WriteLine($"{Tracker.NumberOfRequests} requests. " + Environment.NewLine +
    $"Promoted bytes: {info.PromotedBytes}. Total: {Tracker.TotalPromotions}" + Environment.NewLine +
    $"FinalizationPendingCount: {info.FinalizationPendingCount})" + Environment.NewLine +
    $"FragmentedBytes: {info.FragmentedBytes}" + Environment.NewLine +
    $"TotalCommitedBytes: {info.TotalCommittedBytes}" + Environment.NewLine +
    $"HeapSize: {info.HeapSizeBytes}. Growth: {info.HeapSizeBytes - Tracker.PreviousHeapSizeBytes}" + Environment.NewLine +
    $"");

Tracker.PreviousHeapSizeBytes = info.HeapSizeBytes;   

These were the results:

200 requests.
Promoted bytes: 1789880. Total: 1789880
FinalizationPendingCount: 21
FragmentedBytes: 1554000
TotalCommitedBytes: 7925760
HeapSize: 3343880. Growth: 3343880

400 requests.
Promoted bytes: 1760784. Total: 3550664
FinalizationPendingCount: 17
FragmentedBytes: 1538736
TotalCommitedBytes: 10219520
HeapSize: 3299520. Growth: -44360

600 requests.
Promoted bytes: 1765328. Total: 5315992
FinalizationPendingCount: 4
FragmentedBytes: 1539048
TotalCommitedBytes: 10809344
HeapSize: 3304376. Growth: 4856

800 requests.
Promoted bytes: 1740568. Total: 7056560
FinalizationPendingCount: 4
FragmentedBytes: 1914240
TotalCommitedBytes: 11333632
HeapSize: 3654808. Growth: 350432

1000 requests.
Promoted bytes: 1742088. Total: 8798648
FinalizationPendingCount: 4
FragmentedBytes: 1924272
TotalCommitedBytes: 11988992
HeapSize: 3666360. Growth: 11552

1200 requests.
Promoted bytes: 1740568. Total: 10539216
FinalizationPendingCount: 4
FragmentedBytes: 1539496
TotalCommitedBytes: 12709888
HeapSize: 3280064. Growth: -386296

1400 requests.
Promoted bytes: 1743336. Total: 12282552
FinalizationPendingCount: 4
FragmentedBytes: 1539568
TotalCommitedBytes: 13037568
HeapSize: 3282904. Growth: 2840

1600 requests.
Promoted bytes: 1741288. Total: 14023840
FinalizationPendingCount: 4
FragmentedBytes: 1539544
TotalCommitedBytes: 13168640
HeapSize: 3280832. Growth: -2072

So it seems that bytes are promoted on a steady basis. According to What makes an object survive, they shouldn't be surviving, since I'm not keeping any references, but then those bytes could be something entirely unrelated I suppose. Looking at the HeapSize, it doesn't seem to be increasing. So according to this section, I believe that it confirms your belief that this is not a managed memory leak.

Looking at dotProfile results, it seems the unmanaged memory is definetly though image image

I've tried to run a Visual Studio profiling session but there was not meaningful in there. Tomorrow I'll get started with WinDbg and see if I can get better results.

henridd commented 1 year ago

It seems that the bigger the response, the bigger the leak. After making the changes below, it took less than 60 requests for the memory consumption to double.

void UseAppHealthChecks(WebApplication webApplication)
{
    webApplication.MapHealthChecks("/healthz", new HealthCheckOptions
    {
        Predicate = healthCheck => healthCheck.Tags.Contains(nameof(TestHealthCheck)),
        ResponseWriter = ResponseWriter
    });
}

Task ResponseWriter(HttpContext context, HealthReport healthReport)
{
    var lengthyString = "";
    for (int i = 0; i < 100; i++)
    {
        lengthyString += Guid.NewGuid().ToString();
    }

    return context.Response.WriteAsync(lengthyString);
}

I've downloaded Deleaker, and it seems that this MetaDataGetDispenser is mentioned several types image

When comparing the first snapshot with the last one, it shows up again image

With WinDbg, the output from !heap -stat is very similar, even though the memory consumption is increasing a lot. The only thing different is the External fragmentation. Could it be something? The 1st command was right after startup.

0:034> !heap -stat
NtGlobalFlag enables following debugging aids for new heaps:
    stack back traces
LFH Key                   : 0x1338f8338a5f6faf
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
000001f82f930000 08000002   16756  10028  16364    119   122     5    2      3   LFH
000001f8276c0000 08008000      64      4     64      2     1     1    0      0      
000001f831380000 08001002    1472     92   1080     23     2     2    0      0   LFH
000001f8315e0000 08001002      60      8     60      2     1     1    0      0      
000001f831720000 08041002      60      8     60      5     1     1    0      0      
-------------------------------------------------------------------------------------
0:034> g
(1d00.5dc0): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`463d0b70 cc              int     3
0:034> !heap -stat
NtGlobalFlag enables following debugging aids for new heaps:
    stack back traces
LFH Key                   : 0x1338f8338a5f6faf
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
000001f82f930000 08000002   16756  10028  16364   2076   129     5    2      3   LFH
    External fragmentation  20 % (129 free blocks)
000001f8276c0000 08008000      64      4     64      2     1     1    0      0      
000001f831380000 08001002    1472     92   1080     23     2     2    0      0   LFH
000001f8315e0000 08001002      60      8     60      2     1     1    0      0      
000001f831720000 08041002      60      8     60      5     1     1    0      0      
-------------------------------------------------------------------------------------
0:034> g
(1d00.3a1c): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`463d0b70 cc              int     3
0:023> !heap -stat
NtGlobalFlag enables following debugging aids for new heaps:
    stack back traces
LFH Key                   : 0x1338f8338a5f6faf
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
000001f82f930000 08000002   16756  10028  16364   2303   145     5    2      3   LFH
    External fragmentation  23 % (145 free blocks)
000001f8276c0000 08008000      64      4     64      2     1     1    0      0      
000001f831380000 08001002    1472     92   1080     23     2     2    0      0   LFH
000001f8315e0000 08001002      60      8     60      2     1     1    0      0      
000001f831720000 08041002      60      8     60      5     1     1    0      0      
-------------------------------------------------------------------------------------

I'm not sure where I should go from here. Is there a cleanup method that I'm missing?

mitchdenny commented 1 year ago

I'm not seeing evidence of a managed or unmanaged memory leak here. If there was I would expect that as you executed more requests that the amount of memory consumed would balloon and eventually the process would crash (assuming you kept firing load into it).

Using your repro I used Postman to fire 3000+ requests to the health-check endpoint and memory utilization didn't go above 60MB.

We'll probably need to see stronger evidence of an issue here to keep investigating.

henridd commented 1 year ago

I'm not seeing evidence of a managed or unmanaged memory leak here. If there was I would expect that as you executed more requests that the amount of memory consumed would balloon and eventually the process would crash (assuming you kept firing load into it).

Using your repro I used Postman to fire 3000+ requests to the health-check endpoint and memory utilization didn't go above 60MB.

We'll probably need to see stronger evidence of an issue here to keep investigating.

Hey @mitchdenny thanks for looking into this. That's precisely what happens here. Did you try my latest example, where the health check writes a bunch of GUIDs in the Response? That caused the memory consumption to raise super fast. I've set up a git repository with the code: https://github.com/henridd/health-check-mem-leaking.

I thought that it could be somehow something faulty on my machine, but we began this investigation after finding a leak on the server.

Video.webm

I've attached a video which shows the memory consumption increasing. It does seem to somehow stabilize once it reaches 200mb though, but I suppose it shouldn't increase that much, right?

mitchdenny commented 1 year ago

That isn't surprising in your example because you are allocating a bunch of strings every time you do a health check so there will be more memory being consumed. Each iteration of your string generation logic in is going to generate > 300kb of memory usage (that is a very wasteful way to generate a long string by the way - but that may have been intentional).

Do a hundred requests and you've allocated > 30MB of just strings and that is before we get to any specific .NET or ASP.NET overheads.

You might like to read a bit more background on the different GC modes we have in .NET and how they impact when memory is collected:

https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/workstation-server-gc

henridd commented 1 year ago

I see. I assumed that the memory consumption shouldn't exceed the initial amount of consumed memory, as nothing is being stored, but I probably should investigate this further in order to understand it better. To be fair, from the beginning it sounded strange to me that there would be a leak in something so used, in a normal scenario.

I will keep investigating and researching about memory and try to reduce the memory consumption.

Thanks @davidfowl and @mitchdenny for your inputs