DataDog / dd-trace-dotnet

.NET Client Library for Datadog APM
https://docs.datadoghq.com/tracing/
Apache License 2.0
454 stars 140 forks source link

Memory leak when using MongoDb integration #1602

Closed tylerohlsen closed 3 years ago

tylerohlsen commented 3 years ago

Describe the bug When the MongoDb integrations are present in the dotnet-integrations.json, our memory usage slowly increases at an approximate average rate of 5MB per hour. This continues until it has accumulated enough memory for our tasks in ECS/Fargate to run out of memory and kill/restart our containers.

To Reproduce Steps to reproduce the behavior:

  1. Create a task in AWS ECS/Fargate with a .NET 5 service auto-instrumented with dd-trace-dotnet and standard dotnet-integrations.json
  2. Hook up the periodic health check to run the Ping command on the Mongo server
  3. Let the service sit idle for many days
  4. See the memory usage of the task increase over time from within CloudWatch (or other monitoring tool)

Screenshots DdTraceDotnet-MongoMemoryLeak-1

Runtime environment (please complete the following information):

andrewlock commented 3 years ago

Hi, thanks for the detailed analysis! We've just released a pre-release version of the tracer (1.28.1-prerelease) which addresses a memory leak in the native portion of the tracer. Would it be possible for you to update to the latest and see if that has an impact? If not, I'll try and reproduce it locally using the Ping command.

tylerohlsen commented 3 years ago

Excellent! I'll get that deployed to a test environment tomorrow. I should have some results by tomorrow afternoon (7/15).

tylerohlsen commented 3 years ago

Looks like that did not fix it.

DdTraceDotnet-MongoMemoryLeak-2

andrewlock commented 3 years ago

Thanks for the info @tylerohlsen, which version of the MongoDB.Driver are you using? Also, would it be possible to share the "ping" code you're using to call the mongo db? Thanks

tylerohlsen commented 3 years ago

I'm using 2.12.2

andrewlock commented 3 years ago

Hi @tylerohlsen, I'm having trouble reproducing this locally. I tried creating a simple web api that tries to load an object from a mongodb, and call that repeatedly in a loop. Left it hitting that api 10x a second, but couldn't see an appreciable increase in memory.

I think it may be best if you could please create a support ticket by emailing support@datadoghq.com, so we can track down the problem. Please include a link to this github issue if you do. It would be especially useful if you can provide a memory dump of the problematic service, as well as code snippets. Thanks!

devbrains-com commented 3 years ago

It looks like we face the same issue.

We see the memory leak and also strange traces:

image

We tried the latest 1.28.2 release.

Any updates on that?

hiiru commented 3 years ago

I did some analysis on it last week since we noticed the issue.

Setup: CLR: .NET Core 3.1 MongoDb.Driver: 2.12.0 DataDog.Trace: 1.28.2 OS: Linux (inside docker)

Usage: API controller that executes a simple Mongo Find() query running on Linux docker in K8s. The MongoClient + MongoCollection are kept in a singleton class as recommended. MongoDb cluster is a PSA setup (Primary, Secondary and Arbiter).

Behaviour: The application grows in memory over time and doesn't report mongodb stats correctly. It looks like not all mongodb metrics are tracked, and some reported spans are weird (e.g. single webrequest with 2+ hours (sum) of isMaster checks) datadog-mongodb-weird-metrics

Note: other DataDog metrics are working fine in the same application, only mongodb is affected.

Analysis: In gcdumps it's visible that some DataDog Scopes keep increasing, all of them with MongoDbTags. The percentage increases over time since they aren't garbage collected. datadog-mongodb-perfview-flamegraph

With a full memory dump I notice a Datadog.Trace.Span[] with size 65536 slowly populating (I assume some buffer). Initially all items are null, but keep getting populated with Datadog.Trace.Span instances (holding Datadog.Trace.ClrProfiler.Integrations.MongoDbTags). This looks like there is an issue with flushing the data or lifetime management of a Scope.

The gcroot of that Span[] is somewhere after the MongoCollection itself

[... MVC / App noise ... ]
->  00007FBF306E8768 MongoDB.Driver.MongoCollectionImpl`1[[MyApplication.MongoDbModel, MyApplication]]
->  00007FBEF056EF40 MongoDB.Driver.Core.Clusters.MultiServerCluster
->  00007FBEF056F680 System.Collections.Generic.List`1[[MongoDB.Driver.Core.Servers.IClusterableServer, MongoDB.Driver.Core]]
->  00007FBEF05703A8 MongoDB.Driver.Core.Servers.IClusterableServer[]
->  00007FBEF056F928 MongoDB.Driver.Core.Servers.Server
->  00007FBEF056FDA0 MongoDB.Driver.Core.Servers.ServerMonitor
->  00007FBEF056FDA0 MongoDB.Driver.Core.Servers.ServerMonitor
->  00007FBF31472658 MongoDB.Driver.Core.Connections.BinaryConnection
->  00007FBF31472F58 System.Net.Sockets.NetworkStream
->  00007FBF31470FC0 System.Net.Sockets.Socket
->  00007FBF31471048 System.Net.Sockets.SafeSocketHandle
->  00007FBF31472E00 System.Net.Sockets.SocketAsyncContext
->  00007FBF31473778 System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation
->  00007FBF31473738 System.Action`5[[System.Int32, System.Private.CoreLib],[System.Byte[], System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.Net.Sockets.SocketFlags, System.Net.Sockets],[System.Net.Sockets.SocketError, System.Net.Primitives]]
->  00007FBF31473628 System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs
->  00007FBF319298F8 System.Threading.Tasks.ValueTask`1+ValueTaskSourceAsTask[[System.Int32, System.Private.CoreLib]]
->  00007FBF31929948 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[MongoDB.Driver.Core.Misc.StreamExtensionMethods+<ReadAsync>d__1, MongoDB.Driver.Core]]
->  00007FBF31929A08 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[MongoDB.Driver.Core.Misc.StreamExtensionMethods+<ReadBytesAsync>d__4, MongoDB.Driver.Core]]
->  00007FBF31929AA8 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Bson.IO.IByteBuffer, MongoDB.Bson],[MongoDB.Driver.Core.Connections.BinaryConnection+<ReceiveBufferAsync>d__54, MongoDB.Driver.Core]]
->  00007FBF31929B48 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Bson.IO.IByteBuffer, MongoDB.Bson],[MongoDB.Driver.Core.Connections.BinaryConnection+<ReceiveBufferAsync>d__55, MongoDB.Driver.Core]]
->  00007FBF31929BF8 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Driver.Core.WireProtocol.Messages.ResponseMessage, MongoDB.Driver.Core],[MongoDB.Driver.Core.Connections.BinaryConnection+<ReceiveMessageAsync>d__57, MongoDB.Driver.Core]]
->  00007FBF31929C98 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Bson.BsonDocument, MongoDB.Bson],[MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1+<ExecuteAsync>d__19[[MongoDB.Bson.BsonDocument, MongoDB.Bson]], MongoDB.Driver.Core]]
->  00007FBF31929E70 System.Threading.Tasks.AwaitTaskContinuation
->  00007FBF319293F0 System.Threading.ExecutionContext
->  00007FBF31929370 System.Threading.AsyncLocalValueMap+MultiElementAsyncLocalValueMap
->  00007FBF31929388 System.Collections.Generic.KeyValuePair`2[[System.Threading.IAsyncLocal, System.Private.CoreLib],[System.Object, System.Private.CoreLib]][]
->  00007FBF31929340 Datadog.Trace.Scope
->  00007FBF31929218 Datadog.Trace.Span
->  00007FBF319291D0 Datadog.Trace.SpanContext
->  00007FBEF0554E88 Datadog.Trace.SpanContext
->  00007FBEF0554E30 Datadog.Trace.TraceContext
->  00007FBF701A2B80 Datadog.Trace.Span[]
andrewlock commented 3 years ago

Thanks @hiiru for providing all the extra details, this is really useful.

The source of the problem

We think we've tracked down the source of the issue, unfortunately there appears to be multiple facets to this.

  1. The MongoClient (incorrectly) flows the execution context from the current thread when it is being created. That means if there's an active trace at the time the MongoClient is created, the background thread that checks the status of the cluster inherits that active trace. Even if the scope is closed on another thread, the background thread will keep adding spans to the trace.
  2. If you register your MongoClient as a singleton service in ASP.NET Core's DI container, it is "lazily" created when it is first needed. That's very application-dependent, but it's likely to be inside a request trace or something similar, which means it's likely to capture a trace.
  3. If the parent trace is "long-lived", as opposed to a typical "short-lived" request trace, then the trace will not be flushed, and the background thread will continue adding spans to it, causing the memory leak.

Solutions

Depending on the exact behaviour you're seeing, and the nature of the issue, there are various ways to mitigate/fix the issue.

1. Stop flowing execution context in MongoClient

This is the real solution to the problem, which would involve an update to the Mongo Csharp Driver to prevent flowing the execution context. I'd suggest filing a support case with Mongo - as a customer hopefully this will be prioritised!

2. Create the MongoClient ahead of time

Depending on your application, you may find that you can solve the problem by registering your singleton MongoClient instance in DI directly. For example, if you were previously doing this:

public void ConfigureServices(IServiceCollection services)
{
    services.AddSingleton<IMongoClient>(_ => new MongoClient(Configuration.GetConnectionString("mongo")));
}

Do this instead:

public void ConfigureServices(IServiceCollection services)
{
    var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
    services.AddSingleton<IMongoClient>(mongoClient);
}

This ensures that the context is captured when building Startup, instead of when the client is first used. As long as you aren't creating long-lived scopes that encompass this part of the app lifecycle (which we would not recommend anyway) then this may solve your issue.

More generally, if you're not using this DI pattern, ensure that you're not creating the MongoClient inside a "long-lived" trace/scope. If you are creating the client inside an existing scope, you can ensure you don't capture the context by calling ExecutionContext.SuppressFlow(), for example:

public void ConfigureServices(IServiceCollection services)
{
    using(_ = System.Threading.ExecutionContext.SuppressFlow())
    {
        var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
    }
    services.AddSingleton<IMongoClient>(mongoClient);
}

3. Enable partial flush

We generally advise against creating long-lived traces. By design, traces remain in memory until the last span is closed, so if you have a long-lived trace with many child spans, these will use an increasing amount of memory as the app lifetime increases.

In some cases, you can mitigate this issue by enabling partial flush for the .NET Tracer, as described in the documentation. If the above solution isn't possible or doesn't resolve the issue, it may be worth trying DD_TRACE_PARTIAL_FLUSH_ENABLED=true.

Other possible enhancements

There seems to be little utility in seeing these infrastructural isMaster spans in DataDog, as they mostly seem to clutter up traces (completely aside from the memory issue described). Would you be interested in an option in the tracer that exclude these spans from automatic instrumentation by default (and other "infrastructural" commands), allowing enabling them with an environment variable?

tylerohlsen commented 3 years ago

That's good news @andrewlock!

I have traces/spans that run inside Configure and ConfigureServices so I can monitor how long those take, so workaround 2 won't help me directly. But I also have a clear spot where I create the MongoClient for the first time. If I add a call to ExecutionContext.SuppressFlow() before I create the first MongoClient, would that clear the trace? Or is there a different explicit call to this Datadog.Trace library I can make to clear the trace? And is SuppressFlow() what the MongoDB driver team would need to add to fix it?

Also, I see in the documentation that DD_TRACE_PARTIAL_FLUSH_ENABLED should only be used "when you have long-lived traces or traces with many spans". Why? What artifacts would be introduced if we turned this on? More memory pressure because of more frequent (but smaller) allocations of spans?

andrewlock commented 3 years ago

I have traces/spans that run inside Configure and ConfigureServices so I can monitor how long those take, so workaround 2 won't help me directly. But I also have a clear spot where I create the MongoClient for the first time. If I add a call to ExecutionContext.SuppressFlow() before I create the first MongoClient, would that clear the trace?

Good point, yes, that should work πŸ‘ I'll update the above comment to mention that as an option for others in the thread.

And is SuppressFlow() what the MongoDB driver team would need to add to fix it?

Yes, I believe so (testing indicates that solves the issue too). I've created a PR for the fix but if you could raise a support issue as a customer that will very likely increase priorities there πŸ™‚

Also, I see in the documentation that DD_TRACE_PARTIAL_FLUSH_ENABLED should only be used "when you have long-lived traces or traces with many spans". Why? What artifacts would be introduced if we turned this on?

Partial flush is only supported in recent versions of the agent, so you'll need to make sure you're using version 7.26.0+. The main artifact you'll find is that child spans will appear in Datadog, with the parent span only arriving later. If you had a span that lasts the lifetime of the application, for example, it wouldn't be flushed until after the app shuts down for example.

tylerohlsen commented 3 years ago

I went with both suppressing execution context flow and DD_TRACE_PARTIAL_FLUSH_ENABLED=true and I re-enabled the MongoDB integration in the integrations.json with version 1.28.2 of the library. With those both in place, I do not see the memory leak!

Also, we use SignalR. By default, each SignalR connection (per user) has one long running trace where each call to the SignalR hub is a span and the trace doesn't complete until the user disconnects (which could be a long time). Turning on DD_TRACE_PARTIAL_FLUSH_ENABLED seems to have dramatically reduced our memory overhead in this SignalR service. So there's a bonus!

hiiru commented 3 years ago

I deployed a fix with SuppressFlow and re-enabled the integration few hours ago, it looks better. However the timeframe is still too short to say it definitely, I will give a update later this week.

Also the mongodb metrics are now tracked fine as expected. πŸ‘

There seems to be little utility in seeing these infrastructural isMaster spans in DataDog, as they mostly seem to clutter up traces (completely aside from the memory issue described). Would you be interested in an option in the tracer that exclude these spans from automatic instrumentation by default (and other "infrastructural" commands), allowing enabling them with an environment variable?

Yes, that would be great. There are a lot of isMaster spans and a few saslContinue and buildInfo spans which are only a background noise (didn't notice any other unexpected commands in our case yet). Alternatively a way to specify what to ignore would be fine as well (not sure if there is one already, didn't check it).

However from your linked list, please never hide those: createUser and updateUser (but possibly censor sensitive content like the password, didn't check what's traced here) I assume those aren't used often in normal applications anyways, however in case there is a security incident, traces showing those might be an advantage for security forensics (not really an APM topic, still any information source might help, especially for smaller environments without a dedicated SIEM).

andrewlock commented 3 years ago

@tylerohlsen that's great news, thanks for the update! SignalR seems does indeed seem like a good case for the partial flush. Let us know if you run into any more issues.

I deployed a fix with SuppressFlow and re-enabled the integration few hours ago, it looks better.

Thanks @hiiru, hopefully this continues to look good after running for a while

Yes, that would be great

Thanks for the feedback, and I agree regarding the list. I think this would be a better candidate list to ignore:

hiiru commented 3 years ago

After 2 days with the SuppressFlow fix, the memory still looks good. This definitely fixed the issue. πŸ‘ image

@andrewlock Thank you for the solution and your MongoDb PR to fix it.

andrewlock commented 3 years ago

That's great news @hiiru, thanks for the update. In which case, I'll close this issue. Feel free to reopen if the issue reappears. Thanks.