djluck / prometheus-net.DotNetRuntime

Exposes .NET core runtime metrics (GC, JIT, lock contention, thread pool) using the prometheus-net package
MIT License
355 stars 84 forks source link

Memory Leak, Increasing CPU usage #6

Open bdrupieski opened 5 years ago

bdrupieski commented 5 years ago

This is an awesome library. Thanks for building this.

I dropped this into two .NET Core 2.2 services running in Kubernetes yesterday morning, setting it up to run all of the collectors like this:

DotNetRuntimeStatsBuilder.Customize()
    .WithContentionStats()
    .WithJitStats()
    .WithThreadPoolSchedulingStats()
    .WithThreadPoolStats()
    .WithGcStats()
    .StartCollecting();

Throughout the day I saw memory and CPU usage kept climbing. It looks like there's a memory leak somewhere, and something is causing CPU usage to increase over time.

I already have a metric derived from Process.GetCurrentProcess().WorkingSet64. This is what that metric did yesterday:

image

The green box shows memory usage when I had all of this library's collectors running. This is for one of the two services, and there are three instances of each, which is why there are three lines on the graph. Each line is Process.GetCurrentProcess().WorkingSet64 over time for that instance. Each time the colors change is from a deployment when the pod names change; the query I'm using for that chart treats them as a completely different metric name at that point.

Here's a graph showing the summed rate of Process.GetCurrentProcess().PrivilegedProcessorTime.TotalSeconds and Process.GetCurrentProcess().UserProcessorTime.TotalSeconds for all instances of each of the two services over the past 2 days:

image

Sorry for not including the legend -- it has names I don't want to share. The yellow and green lines on the bottom are privileged processor time. The other two are user processor time. One of them is so spiky because it's running a pretty intense scheduled job every 20 minutes. The other is only responding to HTTP requests.

I turned everything off except the thread pool scheduling stats last night. The most recent parts of the above charts are from only running this:

DotNetRuntimeStatsBuilder.Customize()
    .WithThreadPoolSchedulingStats()
    .StartCollecting();

If there's a problem, then it's not with the thread pool scheduling stats.

I'll add the others back one at a time to try to find where the problem is. I'll report back to this issue with my findings.

Thanks again for this library. I didn't know about the CLR events this library records. I'm thankful you wrote this both to learn about them and also to get insight I didn't know was so readily available.

djluck commented 5 years ago

Hey @bdrupieski, thanks for the report, I'll take a look into this over the next week at some point. This library makes use of a number of internal caches and I suspect that these aren't being evicted correctly, leading to higher memory consumption. I also suspect that the increased CPU use could be due to more frequent/ aggressive GC activity- if you could provide a screenshot of the number of GC collections for the impacted time period, that would help confirm this theory.

bdrupieski commented 5 years ago

Here's the past 14 days of the garbage collection rate obtained from calling GC.CollectionCount:

image

A little bit higher but not too far outside the norm.

CPU and memory remain stable with only ThreadPoolSchedulingStatsCollector. I tried adding ThreadPoolStatsCollector back in but it caused the same problem. I removed that one then tried it for a while with ContentionStatsCollector and it had the same problem, though CPU usage and memory didn't climb so quickly.

Here's a graph showing this for CPU usage from the past 4 days:

image

djluck commented 5 years ago

@bdrupieski I believe I've fixed the issue, I discovered that EventPairTimer wasn't removing items from it's cache.

I've pushed v2.0.9-beta and v3.0.9-beta to nuget, hopefully this should fix the issue.

bdrupieski commented 5 years ago

Thanks for digging into this and for the quick turnaround. I'll give this a try. I'll turn them all back on and see what happens. I'll report back here with what I find.

bdrupieski commented 5 years ago

I upgraded to v3.0.9-beta and tried to turn all the collectors back on. It. seems to still be happening

Here's memory usage. The green box is when they were all on:

image

CPU:

image

I want to try to dig into this myself with a CPU or memory profiler some weekend to figure out what's going on here. Or, to set up remote debugging/profiling to peek into what it's doing exactly when running in Kubernetes and not only locally. I'll try to get to this over the next few days (maybe weeks...).

djluck commented 5 years ago

Hmm interesting. Let me know what you find out- in the meantime, information around the following might help us understand what's going on:

djluck commented 5 years ago

So I'm reasonably confident that there's no memory leak in the strictest sense- cached items are being cleaned up correctly. However, there is an issue with cache growth. Internally, I use ConcurrentDictionary to track pairs of events. If there are thousands. tens of thousands of events being generated a second (which can happen when using the thread pool/ GC stats collectors), the ConcurrentDictionarys have to grow and because the dictionaries grow at an exponential rate, memory consumption can grow without the chance of the memory being reclaimed (ConcurrentDictionary will never shrink unless it's collected by the GC).

I think the solution might be to offer a configurable sampling ratio. This will allow a certain percentage of events to be discarded and reduce cache storage requirements.

bdrupieski commented 5 years ago

Here's sum(dotnet_gc_heap_size_bytes{service="name-of-one-of-the-services"} / 1024 / 1024) by (gc_generation) for the short period I was collecting dotnet_gc_heap_size_bytes recently:

image

Yes, there are memory limits. This is what I have:

resources:
  limits:
    cpu: 1.0
    memory: 1526Mi
  requests:
    cpu: 0.25
    memory: 1024Mi

I'll look into getting a dump.

djluck commented 4 years ago

@bdrupieski I've pushed a new version into nuget- 3.1.0-beta. Give it a whirl, hopefully it should fix those memory consumption issues.

djluck commented 4 years ago

@bdrupieski, I've also included some optional debugging metrics- you can enable these with when you set up the collectors:

DotNetRuntimeStatsBuilder.Customize()
.WithDebuggingMetrics(true)
...
.StartCollecting()
thirus commented 4 years ago

I'm still seeing a very small memory leak, only the process working set grows, while the GC heap is stable. App has with very little to no load (http requests apart from health and metrics). The app is running in kubernetes with memory limit set to 512Mb. Prometheus scrap is set at 10s.

Docker base image is mcr.microsoft.com/dotnet/core/aspnet:2.2:latest
prometheus-net.DotNetRuntime 3.1.1-beta
GC Mode: Workstation (forcibly set in csproj, Server GC does not work)

App Memory Working Set: image

App Only Managed Heap size: image

This seems to be related to CoreCLR 2.2 issue https://github.com/dotnet/coreclr/issues/23562

I'm trying running the app with environment variable COMPlus_EventPipeCircularMB=128 and keeping all the above same, Will update in few days.

TheAngryByrd commented 4 years ago

Can confirm @thirus's issue, it's the same in my app. Looking at memory dumps, there's a lot of XML reminiscent of the EventPipe xml.

djluck commented 4 years ago

@thirus - I think that issue explains why memory consumption gradually rises perfectly (thanks for bringing it to my attention):

When running on .NET core 2.2, I saw the same gradual memory consumption issue but on 3.0, memory use remained stable: image The poorly-drawn 1.) represents when I started benchmarking a test ASP.NET core app (with prometheus-net.DotNetRuntime enabled) on .NET core 2.2 and the barely-legible 2.) is when I started running on .NET Core 3.0.0-preview7.

Another issue chalked up to bugz in 2.2!

bdrupieski commented 4 years ago

Looking at memory dumps, there's a lot of XML reminiscent of the EventPipe xml.

@TheAngryByrd When I was profiling memory to get an idea of what was going on, this is what I saw too.

@djluck Good find that it's a 2.2 thing. I'm looking forward to upgrading soon, once it's GA or RC and no longer in preview.

TheAngryByrd commented 4 years ago

@bdrupieski in the comments of preview 7 release blog, Richard mentioned they're not doing "RC" anymore and we can consider preview 7 as the RC.

thirus commented 4 years ago

Setting ENV COMPlus_EventPipeCircularMB=128 did not help same behavior. The non managed memory increase around ~12-15MB per hour (without any load other than /health and /metrics). So app restarts with OOM exception depending upon the max memory set in docker/k8s. This is really unstable and unfortunate, does anyone have luck with .NET Core 2.1? I don't want to upgrade to 3.0 until its GA.

sywhang commented 4 years ago

Hi all - I'm from the .NET runtime team at Microsoft and I work on EventPipe/LTTng/ETW/counters part of the runtime. I saw this library being used by many of our customers so I decided to come over and see what things we need to improve on, and also wanted to clarify a few things that are going on in this thread.

I'm still seeing a very small memory leak, only the process working set grows, while the GC heap is stable

This is probably the growth in the buffer that the runtime uses to store events internally. This library seems to be using EventListener, which creates an EventPipe session with a buffer size limit of 1024MB in the .NET Core 2.2 runtime. This behavior was fixed in 3.0 (by decreasing the size limit to 10MB) and hence you won't see the constant increase in memory in 3.0. The constant growth in the memory shown in some of those graphs above is due to this - it hasn't hit the maximum limit it thinks it can have, so it's constantly growing. However, if it does grow beyond 1 GB, then we might be looking at a runtime bug. I have https://github.com/dotnet/coreclr/issues/23562 currently filed as a CoreCLR bug that I'm having a hard time to repro but if someone can repro this and send a heap dump for me to debug, I would be more than happy to try to debug it and potentially file a servicing fix for it.

Setting ENV COMPlus_EventPipeCircularMB=128 did not help same behavior

This environment variable is only valid if you are using the environment variable COMPlus_EnableEventPipe to have the runtime write the events directly into a trace file. It doesn't apply to EventListeners which is what this library uses. Unfortunately there isn't a way to configure how much memory EventListener can use for its event buffers.

Sorry that you are all running into issues in 2.2 - While I can't guarantee that I can make the fix go into 2.2 servicing release (the decision is above my pay grade), I'm hoping to come up with a reasonable fix that can be ported back to 2.2 servicing release for those of you who are still planning to use 2.2 after the 3.0 GA.

If there are any issues/suggestions to EventPipe/Counters part of the runtime/API, please feel free to reach out to me at suwhang AT microsoft.com and I'll be happy to chat!

djluck commented 4 years ago

Hey @sywhang, thanks for taking the time to provide us with that level of detail, it's really appreciated. Also I just want to take this opportunity to thank you guys for providing these new telemetry hooks- they're definitely appreciated 👍

Marusyk commented 4 years ago

I have the same issue with memory leak. any updates?

tylerohlsen commented 4 years ago

This issue is now being tracked here: dotnet/coreclr/issue#26344. But also know that the issue is not present in .Net Core 3.0 which is currently supported in production environments and will be fully released on September 23.

semyon2105 commented 4 years ago

We're experiencing the same issue with CPU increase / memory leak on .NET Core 3.0-3.1. The memory supposedly leaks slowly over time but the CPU usage increase is more apparent. Once we turned off the DotNetRuntime collectors, these problems went away.

Here's the usage graphs of our service that uses the DotNetRuntime collectors on .NET Core 3.0: service

I made a small repro with a blank .NET Core 3.1 app that illustrates the issue here: https://github.com/semyon2105/leak-repro

leak-repro

Might be related to https://github.com/dotnet/runtime/issues/31900

sywhang commented 3 years ago

Hi @semyon2105, thanks for the detailed repro steps. I tried to repro your issue and ran the app you shared and here is what I have so far after running the app for ~10 hours:

image

The metric I used to draw the chart above was collected using dotnet-counters. As you can see the working set (which, on Linux, is equivalent to the RSS) converges to roughly ~10MB more than what it started from. This is expected since the internal buffers used by the runtime to store events for an EventListener session is capped to 10 MB. If it grew beyond the 10MB and showed an unbounded growth, I would be worried and will have to investigate further into the cause, but what I have seen so far does not point to that : )

The increase in memory usage in the graph you posted above may be related to an issue I fixed in .NET Core 3.1.5 servicing release, where the native buffer allocation pattern in EventPipe was causing an internal heap fragmentation in glibc's internal data structure used in the implementation of malloc. So it is possible for you to see an unbounded growth if you are on .NET Core version 3.1.4 or below.

Regarding the CPU usage increase, I haven't done a full CPU trace and analyzed it, but from my knowledge on how .NET runtime eventing works, the CPU usage increase is probably caused by the way metrics are captured in this library (which is via EventListener to turn on runtime event providers). These runtime event providers are normally used for in-depth tracing of the runtime components' behavior such as the GC, JIT, ThreadPool, Assembly Loader, etc. These are not cheap, since the runtime has to populate the event payload to send to the listener, and that alone can slow down things by quite a bit.

The exact performance overhead will vary largely depending on the characteristics of your application and the specific metrics you turn on. For instance, if your app is allocating a lot of objects and you turn on the GcStats collector in this library, that will turn on the GC eventing in the runtime at a verbose level. That will give you a lot of useful GC info, but it comes at a cost. For instance, the GC has to fire populate some event payload and fire an event every time it marks a Type for collection during its mark phase, every time it allocates a new chunk of managed memory, etc. These are not things you want to be firing all the time in a performance-sensitive environment, since it will slow down the GC quite a bit. The verbose keyword is usually used when you are actively diagnosing some GC issue in your app.

Changing the sampling rate could help alleviate a little bit, but I wouldn't expect it to change the behavior that much since much of the cost is probably coming from the runtime populating the event payload (not parsing the event payload on the consumer side does not mean the runtime is not populating those events' payloads). The performance impact, as you might expect, would be even greater on an app that's doing actual work (i.e. a heavily loaded server) compared to a blank app that's not doing anything (since the runtime components like GC or ThreadPool would be stressed further in a loaded server and the perf hit of populating the event payload makes them run slower).

This is why the runtime metrics exposed by System.Runtime provider does not compute the metrics using EventListener or the runtime events, but by querying internal data structures in the runtime directly.

Part of the effort we are working on for .NET 5 and .NET 6 is to convince area-owners of various runtime components in .NET to expose APIs for developers to get some of these "more in-depth" metrics without having to start expensive tracing sessions. For example, the GC has added GCMemoryInfo API for you to get a lot of useful GC related metrics through an API call. The ThreadPool also has few APIs you can invoke to get the metrics directly instead of having to start a tracing session. We're hoping to increase the scope of metrics we cover by adding similar APIs for other runtime components as well, so if you have feedback on metrics you'd like to see, please feel free to come to the dotnet/runtime repo and submit feature requests.

If you have any more questions regarding the behavior of EventListener, EventPipe, or how the .NET runtime emits the runtime events, I am happy to explain more so please let me know : )

simbadltd commented 3 years ago

I have the same issue after adding Prometheus monitoring with default metrics to .netcore31 app. I can see that most of thread pool threads waiting for cpu about 10-15% of total time. After disabling Prometheus monitoring threads wait about 5% of total time.

djluck commented 3 years ago

What version of .NET 3.1 is the container using? Version < 3.1.5 has a memory leak that can cause this CPU consumption you're experiencing (see this MR dotnet/runtime#35924).

rwkarg commented 3 years ago

We saw the same issue with 3.1.7

djluck commented 3 years ago

I think this comes down to what @sywhang was saying above- this library collects too many events from the .NET runtime. This isn't intentional but to get at the few "interesting" events it has to enable a level of verbosity that can see hundreds/ thousands of events generated a second. In V4 (which I'm struggling to find time to dedicate to) I'm trying to move towards an approach with less impact- selectively enabling traces when it's useful to do so but otherwise relying on event counters. I do think there is some underlying issue in the .NET runtime but from what I can tell it wasn't designed for these kind of long-running collections.

simbadltd commented 3 years ago

I have 3.1.3. Now, I've upgraded to 3.1.5. Seems that threads are spinning on EventPipe lock. Need to verify.

sywhang commented 3 years ago

The issue I fixed resolved memory issue but likely won't impact CPU usage. For CPU usage, the main issue is that this library uses EventListener to consume the events, which is not the mainline scenario for tracing. CPU usage increasing is due to the lock used by the runtime to emit the events in order. Apart from the issue I described above with the verbosity of events being consumed, EventListeners also aren't really that performant from its design since it wasn't designed to be something you'd use for a verbose and long running session.

That being said, EventPipe does have a scalability issue especially when events are written from many threads (ex. More than 30) and we're currently working on improving its scalability for the next upcoming .NET 6 release.

In general, performance of the app and the verbosity of events you get/depth of diagnostics is trade off. Some of the events being turned on by this library is very expensive, and those events weren't meant to be used in a long running session.

simbadltd commented 3 years ago

@sywhang Thanks for the detailed answer. Could you pls clarify is 'disabling/reducing metrics amount' only solution i can go with? Is it correct that default settings for prometheus-net library is not suitable for long-running process? In other words, should I suppress DotNetStats metrics for production-ready app by default and enable them only for some time intervals due to investigating concrete problems?

djluck commented 3 years ago

@simbadltd I think that would be appropriate. I'm trying to reproduce the issue myself so I can investigate what impact briefly enabling/ disabling the trace listeners might have.

djluck commented 3 years ago

I believe I have reproduced the issue and have a workaround.

After (partially) listening to @sywhang's observation that event sessions were not meant to be long-running, I tried disposing of the collector and then re-enabling it shortly after. This was the result: image

The blue line marks the point where I disposed of and re-created the runtime stats collector- CPU immediately drops back to levels seen at the start of the application.

I will be incorporating periodic recycling of the collector into a later release but for now, you can always Dispose() and re-create the stats collector.

simbadltd commented 3 years ago

@djluck Thanks for your time. I can confirm WA proposed is working. CPU load has been stabilized along with the service throughput and threadpool scheduling latency: image image

The yellow line marks the point where WA was applied.

djluck commented 3 years ago

@simbadltd thanks for confirming this fixes things for you, appreciate the report mate.

Obviously disabling metric collection is not ideal but a quick enable/ disable should see minimal disruption to metrics collected. I'm going on holiday for the next week but will be keen to get back to work on V4 and this fix once I'm back.

rwkarg commented 3 years ago

CPU is looking stable so far with the work around after two days.

I added the following in the ASP.NET Core Startup.Configure method:


Task.Run(async () =>
  {
    var delay = TimeSpan.FromMinutes(5);
    while (!applicationLifetime.ApplicationStopping.IsCancellationRequested)
    {
      var collector = DotNetRuntimeStatsBuilder.Default().StartCollecting();
      await Task.Delay(delay);
      collector.Dispose();
    }
  });
simbadltd commented 3 years ago

I've used the same solution, but wrapped as a BackgroundService designed for such a long-running tasks:

    public class CollectorService : BackgroundService
    {
        protected override async Task ExecuteAsync(CancellationToken stoppingToken)
        {
              while (!stoppingToken.IsCancellationRequested)
              {
                  using var collector = StartCollection();
                  await Task.Delay(TimeSpan.FromHours(3), stoppingToken);
              }
        }

        private IDisposable StartCollection()
        {
            var builder = DotNetRuntimeStatsBuilder.Customize()
                .WithContentionStats()
                .WithJitStats()
                .WithThreadPoolSchedulingStats()
                .WithThreadPoolStats()
                .WithGcStats();

            return builder.StartCollecting();
        }
    }
djluck commented 3 years ago

So I found a fairly serious bug that you should be aware of if you're using the "dispose after X duration" workaround- https://github.com/dotnet/runtime/issues/49804. This can cause events to stop producing and even application crashes.

This has delayed the release of V4 but I think I have a half-decent workaround for now.

djluck commented 3 years ago

I released V4 today which should resolve all performance issues if you're using DotNetRuntimeStats.Default().StartCollecting()- by default, only event counters are used to generate metrics.

The option remains to enable the existing EventListeners to get more detailed metrics (see https://github.com/djluck/prometheus-net.DotNetRuntime/blob/master/docs/metrics-exposed.md). If you do choose to collect more detailed metrics I would suggest: