dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.06k stars 4.69k forks source link

.Net 7 + 8 Garbage Collection Regression Issue #95191

Open crollins7 opened 10 months ago

crollins7 commented 10 months ago

Description I had a .Net 6 app that would experience a very slow memory leak. Due to this behavior I set the app pool to automatically recycle every 2 hours. I upgraded the app to .Net 7 with hopes that would resolve the issue and instead the memory leak exploded (14 Gb's/Hour). When I would do a memory dump and analyze I noticed that almost all of the memory is listed as "Free". I recently upgraded the app to .Net 8 and the memory leak persists.

Following @Maoni0's blog post https://maoni0.medium.com/is-the-regression-in-gc-or-something-else-38f10018dd21, I referenced clrgc.dll in the runtime config and the symptoms are significantly relieved which I believe indicates a regression in GC starting in .Net 7.

Configuration

Regression As described above there was a slower memory leak originally on .Net 6 and then in .Net 7 and .Net 8 I experienced an insane regression.

Data Additional Stats gathered when I was on .Net 7 without clrgc.dll

The Link Below has Top Level GC Traces and CPU Sample Traces both taken for 1 hour with and without clrgc.dll referenced. https://drive.google.com/file/d/1nJ8lROr4pYN4IQf4E4KoQDxBN0M5XdfA/view?usp=sharing

This highlights the performance impact experienced with and without clrgc.dll referenced in runtime settings. image

Analysis I have no clue the relevant code that may cause this. I am happy to do any additional testing that would be useful in helping to identify the cause.

ghost commented 10 months ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details
**Description** I had a .Net 6 app that would experience a very slow memory leak. Due to this behavior I set the app pool to automatically recycle every 2 hours. I upgraded the app to .Net 7 with hopes that would resolve the issue and instead the memory leak exploded (14 Gb's/Minute). When I would do a memory dump and analyze I noticed that almost all of the memory is listed as "Free". I recently upgraded the app to .Net 8 and the memory leak persists. Following @Maoni0's blog post https://maoni0.medium.com/is-the-regression-in-gc-or-something-else-38f10018dd21, I referenced clrgc.dll in the runtime config and the symptoms are significantly relieved which I believe indicates a regression in GC starting in .Net 7. **Configuration** - .Net 8 - Windows Server 2016 14393.6351 - x64 - VM with 8x Core Xeon 4309Y and 32 Gb's of Ram **Regression** As described above there was a slower memory leak originally on .Net 6 and then in .Net 7 and .Net 8 I experienced an insane regression. **Data** Additional Stats gathered when I was on .Net 7 without clrgc.dll - GC committed bytes increases like crazy throughout the apps runtime. - GC Fragmentation > 95% - GC Heap Size seems relatively stable around 1 GB - The large fragmented memory blocks are all System.Byte[] The Link Below has Top Level GC Traces and CPU Sample Traces both taken for 1 hour with and without clrgc.dll referenced. https://drive.google.com/file/d/1nJ8lROr4pYN4IQf4E4KoQDxBN0M5XdfA/view?usp=sharing This highlights the performance impact experienced with and without clrgc.dll referenced in runtime settings. ![image](https://github.com/dotnet/runtime/assets/1470259/bb2f20d9-f260-497e-b9bc-9452b8ce4396) **Analysis** I have no clue the relevant code that may cause this. I am happy to do any additional testing that would be useful in helping to identify the cause.
Author: crollins7
Assignees: -
Labels: `tenet-performance`, `area-GC-coreclr`
Milestone: -
En3Tho commented 10 months ago

Just curious: what was the heap fragmentation with .Net6 / clrgc?

Also, does memory leak result in oom? Or it's just "GC takes more memory because it can"? If last then have you tried explicit memory limits?

crollins7 commented 10 months ago

Unfortunately when I was on .Net 6 I did not collect those stats before going to .Net 7. That was also 6 months ago so I can't easily recreate it. I can definitively say the memory leak exploded when I went to .Net 7.

The memory leak will result in OOM which is why I have IIS recycle the app pool on a 2 hour schedule. Below is the same screenshot of above with an additional 10 minutes worth of time showing that when I do an app pool recycle I free up the ballooned ram.

I have not tried an explicit memory limit. clrgc.dll being referenced in runtime settings has been my workaround for the past 2 months as it provides significant relief.

image

En3Tho commented 10 months ago

I wonder what your app does. It's quite curious that your gc handle count grows indefinitely. Net8 gc is doing gen1 collections more aggressively but it's quite obvous that gen2 is the problem here as it grows really fast and it's survival rate is almost 100%. Do you have caches that hold those objects forever?

crollins7 commented 10 months ago

The application is an MVC web app utilizing EF Core 8. There is some usage of minimal API and there are also hangfire jobs (both recurring and background enqueued) that run from the same web app. There is light usage of Memory Cache within the app.

The top level GC traces and CPU sample traces in the opening post may be helpful to someone who has a deeper understanding of their interpretation then I do.

cshung commented 10 months ago

Which process is the process of interest in the trace?

crollins7 commented 10 months ago

Which process is the process of interest in the trace?

For the traces with clrgc.dll enabled the PID is 6108. For the traces without PID enabled the PID is 1700.

markoweb2 commented 9 months ago

I believe I just wasted quite a few hours with the exact same issue. After upgrading to .NET 8, I noticed my app did not release memory after a large data export task like it used to with .NET 6. Spent a lot of hours going over my code, maybe I made a change that caused a leak or something.... Finally I tried a simple proof of concept console application and confirmed that it is indeed a GC regression that was introduced with .NET 7.

Here is a super simple console application:

public static void Main(string[] args)
{
    var list = new List<string>();

    for (int i = 0; i < 20000000; i++)
    {
        list.Add(i.ToString());
    }

    Console.WriteLine("done");
    list.Clear();
    list = null;
    GC.Collect();

    Console.ReadLine();
}

Here is what the memory usage is supposed to look like with .NET 6: image

And here is the memory usage with .NET 7/8: image

(this is all in release mode, you can simply change the Target Framework, recompile and see immediate results) The memory does not get released no matter how long you wait or how many GC.Collect() calls you do.

If I output GC.GetTotalMemory(false) before and after the GC.Collect(), then it is clear that the GC does free up the memory, but it would appear that it doesn't allow the OS to actually reclaim the memory from the process. image

Please fix this regression!

crollins7 commented 9 months ago

Great find @markoweb2! I just tested your sample code with .net 6, .net 7, and .net 8 and the results are that > .Net 6 does not fully release the memory. I suspect the root cause affects your sample as well as my production code.

I am very interested to test when there is a fix.

Maoni0 commented 9 months ago

apologies for the delay. and really sorry this is causing problems😥 I just got back from vacation and took a look at the simple repro that @markoweb2 provided. I haven't looked at the traces @crollins7 provided - I'll check with @cshung if he's looked at them tomorrow. I'll give an update on that.

with @markoweb2's repro I'm seeing with segments (what GC uses in 6.0) after the induced GC the committed size for the process is ~40mb whereas with regions (what GC uses in 7.0+) it's ~800mb. it's worth pointing out that if you don't do this induced GC, the committed sizes are very similar. so I should ask, are you inducing GCs in your production scenario? if so this would explain the difference otherwise it may not.

what causes this big difference after the induced GC has to do with the fact regions does more management of the memory it's already got than segments. with segments, especialy with segments in LOH, if they are empty, we just decommit the whole thing whereas regions would hold onto them and decide if we should decommit some, all or none. and the way it decides this is by looking at various factors, most predominantly the allocation budget which generally indicates how much you'd be allocating before a GC is done to collect it. it makes sense to keep so we don't need to decommit it and have to re-commit it.

the more equivalent thing to do in 8.0 is to induce a GC with the GCCollectionMode.Aggressive mode. however I just discovered a bug that causes part of what GC retains to not be returned which I'm fixing via this PR. without the fix I'm seeing ~100mb, with the fix it's ~10mb. less than with segments because this is a mode that you use to tell the GC you will not be making allocations for a while so GC is letting go everything it can let go (barring the bug). so I'm not suggesting that you replace everywhere you call GC.Collect() with GC.Collect (2, GCCollectionMode.Aggressive). however, it does make me think we should at least make GC.Collect() behavior similar for regions as we did with segments.

crollins7 commented 9 months ago

Thank you for providing insight into what is happening under the hood as well as the PR you submitted.

I did review my code, and I can confirm that I do not call GC.Collect anywhere.

Maoni0 commented 9 months ago

I just took a brief look at the traces you provided (thanks so much for collecting these!) and I have a guess what's causing what you saw. most likely it has to do with pinning. there's a lot of pinning going on and there's a known problem with regions for this case that we knew from our internal testing but I just haven't had time to fix. I'll look more in detail to confirm if that's the case (and obviously provide a fix if that is indeed the case). meanwhile, if you wouldn't mind collecting a trace that shows where the pinning comes from, that'd be great! you could use this cmdline -

perfview /nogui /KernelEvents=Process+Thread+ImageLoad /ClrEvents:GC+Stack+GCHandle /clrEventLevel=Informational /MaxCollectSec:1800 /BufferSize:3000 /CircularMB:3000 collect

crollins7 commented 9 months ago

Sounds like you are hot on the trail! I removed clrgc.dll from my runtimeconfig.json, recycled the app pool, and then ran the trace as requested. Link is below and PID is 536.

https://drive.google.com/file/d/1zkOFvFy7ImPUCImLm8S5g7-_1yL4pWn5/view?usp=sharing

Maoni0 commented 8 months ago

almost all of the pinning is from this stack -

         coreclr!MarshalNative::GCHandleInternalAlloc
          system.private.corelib.il!Overlapped.AllocateNativeOverlapped
           system.io.filesystem.watcher.il!FileSystemWatcher.StartRaisingEvents
            microsoft.extensions.fileproviders.physical.il!PhysicalFilesWatcher.TryEnableFileSystemWatcher
             microsoft.extensions.fileproviders.physical.il!PhysicalFileProvider.Watch
              microsoft.extensions.configuration.il!ConfigurationBuilder.Build
               data!SJRContext.OnConfiguring
                microsoft.entityframeworkcore!DbContext.get_ContextServices
                 microsoft.entityframeworkcore!Microsoft.EntityFrameworkCore.Internal.InternalDbSet`1[System.__Canon].get_EntityType()
                 microsoft.entityframeworkcore!DbContext.get_DbContextDependencies

@stephentoub, any idea if this is avoidable?

stephentoub commented 8 months ago

On Windows, FileSystemWatcher creates and pins a buffer when it's started. If that pinning is causing a problem, that suggests you're creating lots of FileSystemWatchers and starting/stopping them frequently? There are things we could do in FileSystemWatcher to tweak this a bit, but I'd first want to understand and address the usage, as from the stack above I'm wondering if the transient nature of these is actually expected.

crollins7 commented 8 months ago

@stephentoub , I just searched our entire solution for "FileSystemWatcher" and came up empty so its not coming directly from our code. It must be part of some other compiled dependency.

We have quite a few nuget packages, however the most notable stuff being utilized in the app are:

Any thoughts on how to identify the source without cutting code to determine the cause?

stephentoub commented 8 months ago

Based purely on the call stack in https://github.com/dotnet/runtime/issues/95191#issuecomment-1882420961, it's most likely from one of: https://github.com/dotnet/runtime/blob/e0ecb3a6af984e12f2ba67e63bf9eff9f09de47c/src/libraries/Microsoft.Extensions.Configuration.FileExtensions/src/FileConfigurationProvider.cs#L35 https://github.com/dotnet/aspnetcore/blob/c53f7e474ba80124fb9134ebf23d1d626a19b458/src/Configuration.KeyPerFile/src/KeyPerFileConfigurationProvider.cs#L32 or something of a similar ilk.

@Maoni0, you mentioned you see a lot of pinning from here. Is it all rooted? My guess would be that something is triggering a bunch of configuration changes, leading to new watchers being created frequently, leading to new buffers getting pinned. That's just a guess, though.

Maoni0 commented 8 months ago

sorry, GH didn't give me notification that there were new comments after my last comment so I didn't know folks already responded till my teammates told me.

@Maoni0, you mentioned you see a lot of pinning from here. Is it all rooted?

@stephentoub, yes, GC is observing new pinned objects getting created. for example, these are the # of pinned objects GC saw between GC#498 and GC#528 -

GC Index Pause Start Trigger Reason Gen Promotion Pinned Obj
498 1,090,432.99 AllocSmall 1N yes 515
499 1,090,709.72 AllocSmall 1N no 223
500 1,090,926.81 AllocSmall 1N no 235
501 1,091,137.96 AllocSmall 1N no 247
502 1,091,462.55 AllocSmall 1N no 258
503 1,091,674.78 AllocSmall 1N no 268
504 1,091,886.47 AllocSmall 1N no 279
505 1,092,106.27 AllocSmall 1N no 289
506 1,092,311.50 AllocSmall 1N no 301
507 1,092,521.42 AllocSmall 1N no 312
508 1,092,723.39 AllocSmall 1N no 322
509 1,092,946.16 AllocSmall 1N no 333
510 1,093,162.04 AllocSmall 1N no 343
511 1,093,377.63 AllocSmall 1N no 354
512 1,093,580.60 AllocSmall 1N no 364
513 1,093,786.75 AllocSmall 1N no 375
514 1,093,990.06 AllocSmall 1N no 385
515 1,094,199.88 AllocSmall 1N no 395
516 1,094,415.38 AllocSmall 1N no 406
517 1,094,624.44 AllocSmall 1N no 416
518 1,094,844.35 AllocSmall 1N no 427
519 1,095,050.95 AllocSmall 1N no 437
520 1,095,279.03 AllocSmall 1N no 447
521 1,095,480.26 AllocSmall 1N no 458
522 1,095,699.76 AllocSmall 1N no 468
523 1,095,945.71 AllocSmall 1N yes 479
524 1,096,212.26 AllocSmall 1N yes 489
525 1,096,449.88 AllocSmall 1N no 192
526 1,096,678.73 AllocSmall 1N no 202
527 1,096,891.55 AllocSmall 1N no 213
528 1,097,109.60 AllocSmall 1N no 223

I added a Promotion column denoting if that GC promoted. not promoting means all objects that survived a generation will stay in that generation. so right after a promoting GC (GC#498, which observed 515 pinned objects), GC#499 observed much fewer - 223. all GCs are non promoting but the # of pinned objects kept increasing which means new pinned objects kept getting created. till a couple of promoting GCs happened (GC#523 and #524), then we see this pattern again.

these file watchers are only created if it something changed? could we be detecting changes that are not there? totally guessing here. what would you need to diagnose this?

crollins7 commented 8 months ago

We regularly dependency inject IConfiguration for the purpose of utilizing the "GetSection" method. This can be seen throughout the application including in middleware.

I am not finding any direct usage of "FileConfigurationProvider". If there is anything additional you guys need just let me know.

stephentoub commented 8 months ago

@roji, the above stack has:

             microsoft.extensions.fileproviders.physical.il!PhysicalFileProvider.Watch
              microsoft.extensions.configuration.il!ConfigurationBuilder.Build
               data!SJRContext.OnConfiguring
                microsoft.entityframeworkcore!DbContext.get_ContextServices
                 microsoft.entityframeworkcore!Microsoft.EntityFrameworkCore.Internal.InternalDbSet`1[System.__Canon].get_EntityType()
                 microsoft.entityframeworkcore!DbContext.get_DbContextDependencies

Do you know what would cause DbContextDependencies to frequently trigger configuration changes?

roji commented 8 months ago

On EF's role here... Every time a new EF DbContext gets instantiated, the OnConfiguring method on it gets invoked - that's by design (different DbContext instances may have different EF configuration etc.). The odd thing here is that FileSystemWatcher is being invoked/started - by user code- from inside OnConfiguring. I suspect that the user assumption here is that OnConfiguring is called only once at program startup, and so some ConfigurationBuilder-related init code was placed there - but that assumption is wrong.

@crollins7 can you confirm what exactly you have in the OnConfiguring method of your SJRContext type? Is there anything in there which could be calling into ConfigurationBuilder config/init, or similar?

crollins7 commented 8 months ago

Here is a copy of our SJRContext.OnConfiguring method @roji. Nothing screams out to me looking at this but I hope it helps.


protected override void OnConfiguring(DbContextOptionsBuilder options)
{
    options.UseLazyLoadingProxies()
           .ConfigureWarnings(x=>x.Ignore(SqlServerEventId.DecimalTypeKeyWarning))
           .ConfigureWarnings(x=>x.Ignore(RelationalEventId.BoolWithDefaultWarning))
        .UseSqlServer(AppSettingsConfig.Configuration.GetConnectionString("SJRContext"),
            sqlServerOptionsAction: sqlOptions =>
            {
                sqlOptions.EnableRetryOnFailure(
                    maxRetryCount: 10,
                    maxRetryDelay: System.TimeSpan.FromSeconds(30),
                    errorNumbersToAdd: null);
                sqlOptions.CommandTimeout((int)TimeSpan.FromMinutes(5).TotalSeconds);
            });
    base.OnConfiguring(options);
}
roji commented 8 months ago

What's "AppSettingsConfig"? That isn't familiar to me as a standard package, is that maybe calling into Microsoft.Extensions.Configuration and triggering this whole thing?

At least as an experiment, I'd advise extracting that out to some place in the program initialization, and then accessing the resulting connection string from OnConfiguring. If the memory issue disappears at that point, at least we know for sure where it came from.

crollins7 commented 8 months ago

That does call Microsoft.Extensions.Configuration to grab our connection string from our appsettings.json. We did this so as not to hard code our connection string into our context. I will try removing it and replacing it with a hard coded connection string and see if the memory issue persists. I will report back after I have tested.

crollins7 commented 8 months ago

@roji, you were wondering what the AppSettingsConfig class did so I pasted it below. Ultimately, we were using that to grab different appsettings.json depending on the the configuration the application was running in. When I removed the use of this class and hardcoded a connection string in our onconfiguring override, initial testing is indicating it resolves the memory leak.

Right now we are working on eliminating any instantiation of SJRContext in the app and relying fully upon dependency injection. This should allow us to be able to get rid of the OnConfiguring override as well as the use of the AppSettingsConfig class.

It appears we shot ourselves in the foot.


namespace Data.EntityHelpers;
public static class AppSettingsConfig
{
    public static IConfiguration Configuration
    {
        get
        {
            string environmentName;
#if RELEASE
environmentName = "production";
#elif BETA
environmentName = "beta";
#else
            environmentName = "Development";
#endif
            return new ConfigurationBuilder()
               .SetBasePath(Directory.GetCurrentDirectory())
               .AddJsonFile("appsettings.json", optional: false, reloadOnChange: true)
               .AddJsonFile($"appsettings.{environmentName.ToLower()}.json", optional: false, true)
               .Build();
        }
    }
}
roji commented 8 months ago

Thanks for the confirmation @crollins7.

@stephentoub and others, regardless of EF the remaining question is still whether repeated execution of the Configuration property above should result in the described behavior (with OOM etc.)...

Danielcc0907 commented 8 months ago

@crollins7 do you use DbContext pooling by any chance?

roji commented 8 months ago

@Danielcc0907 if anything, I'd expect DbContext pooling to mitigate this, since the same DbContext instances would get pooled and reused, meaning that OnConfigured() wouldn't keep getting called.

crollins7 commented 8 months ago

@Danielcc0907 , we do not use DBContext pooling today. We are currently working on eliminating usage of non dependency injected contexts so that we can turn on DBContext Pooling. I expect we will have that complete within a week.

crollins7 commented 8 months ago

We created a new version of our site with the following changes:

We then deployed and monitored the results. This resulted in the app consuming 21 GB's of steadily climbing memory over 29 hours which is significantly better than the 14 GB's/Hr rate we were experiencing. From the graph it appears to still have some memory leak but it is manageable without clrgc.dll and an aggressive app pool recycle time period which was the purpose of this issue.

Thank you everyone for your help.

image

roji commented 8 months ago

@crollins7 thanks for confirming.

The odd thing here is that FileSystemWatcher is being invoked/started - by user code- from inside OnConfiguring.

Although I'm assuming FileSystemWatcher isn't meant to be instantiated repeatedly as it was in this case, we may still want to make sure there isn't an actual problem here - who'd be the right person for this @stephentoub?

En3Tho commented 8 months ago

@crollins7 I guess you can share a gctrace again seeing you still have a memory leak. Who knows, this might be another thing like that or might be something else. Worth a look.

roji commented 8 months ago

It's important to distinguish between an actual memory leak - which results in an OutOfMemoryException, implying that the leaked memory is still referenced/rooted - and simply allocated memory not getting freed, which could be a result of the GC simply not kicking in yet. So I'd be interested in knowing if you're still seeing an OOM.

crollins7 commented 7 months ago

I doubled the app pools recycling time to 58 hours and observed. I did not experience an OOM exception and generally when memory approached 85-90% of the machines limit, GC would contribute a substantial cleanup of 30-50% of the app's utilized memory. I believe everything is stable and I consider my issue resolved. I will leave the issue in an open state until @roji or @stephentoub finish any review.

Thank you again everyone.

roji commented 7 months ago

Looking at this again, FileSystemWatcher is disposable and overrides Dispose(bool). I'm assuming that the above code - which was continuously doing ConfigurationBuilder.Build() - was probably causing FileSystemWatcher to get instantiated each time but it was never disposed, possibly leading to the leak; if so, what was happening here seems clear.

Though the IConfigurationRoot returned by Build() isn't IDisposable, so I'm wondering how one would go about disposing the FileSystemWatcher if one wanted to rebuild the configuration etc (but I know very little about M.E.Configuration).

piju3 commented 3 months ago

There seem to be two issues in this thread:

  1. A reference leak in EF core
  2. .NET 7 + 8 not releasing memory even after it's garbage collected

You can see the second issue with the simple code that @markoweb2 posted. The memory usage stays high even after clearing the list, but it does get internally reused for new objects, and eventually reclaimed if you open another process that also fills up memory. So the team might not consider that a bug, but I can see how most people would consider that undesirable behavior.

Calling GC.Collect with GCCollectionMode.Aggressive seems to clear the memory, but in .NET 8 it still leaves some memory allocated (between 1 and 2GB in some tests I did). This has been fixed in the commit by @Maoni0 which has been merged into .NET 9, so you'll just have to wait for that.

There is a Retain VM setting which I would initially have expected would control this behavior, but it appears to currently not have any effect (it does talk about "segments" and not regions), in fact it's already set to "false" by default. Maybe you could fix both things by having that setting control the behavior?

mangod9 commented 3 months ago

Doesn't look like this issue is actionable. Can we close?

roji commented 3 months ago

There's in theory still my question here, which may be relevant for the owner of M.E.Configuration. All the rest seems resolved.

ProfiseeJoshma commented 1 month ago

I believe #86456 covers that remaining question.