Closed tylerohlsen closed 3 years ago
Hi, thanks for the detailed analysis! We've just released a pre-release version of the tracer (1.28.1-prerelease) which addresses a memory leak in the native portion of the tracer. Would it be possible for you to update to the latest and see if that has an impact? If not, I'll try and reproduce it locally using the Ping command.
Excellent! I'll get that deployed to a test environment tomorrow. I should have some results by tomorrow afternoon (7/15).
Looks like that did not fix it.
1.26.1
with MongoDB integrations1.28.0
with MongoDB integrations1.28.0
without MongoDB integrations1.28.1-prerelease
with MongoDB integrations (datadog-integrations.json for version 1.28.0 - oops!)1.28.1-prerelease
with MongoDB integrations (datadog-integrations.json for version 1.28.1-prerelease)Thanks for the info @tylerohlsen, which version of the MongoDB.Driver are you using? Also, would it be possible to share the "ping" code you're using to call the mongo db? Thanks
I'm using 2.12.2
Hi @tylerohlsen, I'm having trouble reproducing this locally. I tried creating a simple web api that tries to load an object from a mongodb, and call that repeatedly in a loop. Left it hitting that api 10x a second, but couldn't see an appreciable increase in memory.
I think it may be best if you could please create a support ticket by emailing support@datadoghq.com, so we can track down the problem. Please include a link to this github issue if you do. It would be especially useful if you can provide a memory dump of the problematic service, as well as code snippets. Thanks!
It looks like we face the same issue.
We see the memory leak and also strange traces:
We tried the latest 1.28.2 release.
Any updates on that?
I did some analysis on it last week since we noticed the issue.
Setup: CLR: .NET Core 3.1 MongoDb.Driver: 2.12.0 DataDog.Trace: 1.28.2 OS: Linux (inside docker)
Usage: API controller that executes a simple Mongo Find() query running on Linux docker in K8s. The MongoClient + MongoCollection are kept in a singleton class as recommended. MongoDb cluster is a PSA setup (Primary, Secondary and Arbiter).
Behaviour: The application grows in memory over time and doesn't report mongodb stats correctly. It looks like not all mongodb metrics are tracked, and some reported spans are weird (e.g. single webrequest with 2+ hours (sum) of isMaster checks)
Note: other DataDog metrics are working fine in the same application, only mongodb is affected.
Analysis: In gcdumps it's visible that some DataDog Scopes keep increasing, all of them with MongoDbTags. The percentage increases over time since they aren't garbage collected.
With a full memory dump I notice a Datadog.Trace.Span[] with size 65536 slowly populating (I assume some buffer). Initially all items are null, but keep getting populated with Datadog.Trace.Span instances (holding Datadog.Trace.ClrProfiler.Integrations.MongoDbTags). This looks like there is an issue with flushing the data or lifetime management of a Scope.
The gcroot of that Span[] is somewhere after the MongoCollection itself
[... MVC / App noise ... ]
-> 00007FBF306E8768 MongoDB.Driver.MongoCollectionImpl`1[[MyApplication.MongoDbModel, MyApplication]]
-> 00007FBEF056EF40 MongoDB.Driver.Core.Clusters.MultiServerCluster
-> 00007FBEF056F680 System.Collections.Generic.List`1[[MongoDB.Driver.Core.Servers.IClusterableServer, MongoDB.Driver.Core]]
-> 00007FBEF05703A8 MongoDB.Driver.Core.Servers.IClusterableServer[]
-> 00007FBEF056F928 MongoDB.Driver.Core.Servers.Server
-> 00007FBEF056FDA0 MongoDB.Driver.Core.Servers.ServerMonitor
-> 00007FBEF056FDA0 MongoDB.Driver.Core.Servers.ServerMonitor
-> 00007FBF31472658 MongoDB.Driver.Core.Connections.BinaryConnection
-> 00007FBF31472F58 System.Net.Sockets.NetworkStream
-> 00007FBF31470FC0 System.Net.Sockets.Socket
-> 00007FBF31471048 System.Net.Sockets.SafeSocketHandle
-> 00007FBF31472E00 System.Net.Sockets.SocketAsyncContext
-> 00007FBF31473778 System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation
-> 00007FBF31473738 System.Action`5[[System.Int32, System.Private.CoreLib],[System.Byte[], System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.Net.Sockets.SocketFlags, System.Net.Sockets],[System.Net.Sockets.SocketError, System.Net.Primitives]]
-> 00007FBF31473628 System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs
-> 00007FBF319298F8 System.Threading.Tasks.ValueTask`1+ValueTaskSourceAsTask[[System.Int32, System.Private.CoreLib]]
-> 00007FBF31929948 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[MongoDB.Driver.Core.Misc.StreamExtensionMethods+<ReadAsync>d__1, MongoDB.Driver.Core]]
-> 00007FBF31929A08 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[MongoDB.Driver.Core.Misc.StreamExtensionMethods+<ReadBytesAsync>d__4, MongoDB.Driver.Core]]
-> 00007FBF31929AA8 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Bson.IO.IByteBuffer, MongoDB.Bson],[MongoDB.Driver.Core.Connections.BinaryConnection+<ReceiveBufferAsync>d__54, MongoDB.Driver.Core]]
-> 00007FBF31929B48 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Bson.IO.IByteBuffer, MongoDB.Bson],[MongoDB.Driver.Core.Connections.BinaryConnection+<ReceiveBufferAsync>d__55, MongoDB.Driver.Core]]
-> 00007FBF31929BF8 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Driver.Core.WireProtocol.Messages.ResponseMessage, MongoDB.Driver.Core],[MongoDB.Driver.Core.Connections.BinaryConnection+<ReceiveMessageAsync>d__57, MongoDB.Driver.Core]]
-> 00007FBF31929C98 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Bson.BsonDocument, MongoDB.Bson],[MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1+<ExecuteAsync>d__19[[MongoDB.Bson.BsonDocument, MongoDB.Bson]], MongoDB.Driver.Core]]
-> 00007FBF31929E70 System.Threading.Tasks.AwaitTaskContinuation
-> 00007FBF319293F0 System.Threading.ExecutionContext
-> 00007FBF31929370 System.Threading.AsyncLocalValueMap+MultiElementAsyncLocalValueMap
-> 00007FBF31929388 System.Collections.Generic.KeyValuePair`2[[System.Threading.IAsyncLocal, System.Private.CoreLib],[System.Object, System.Private.CoreLib]][]
-> 00007FBF31929340 Datadog.Trace.Scope
-> 00007FBF31929218 Datadog.Trace.Span
-> 00007FBF319291D0 Datadog.Trace.SpanContext
-> 00007FBEF0554E88 Datadog.Trace.SpanContext
-> 00007FBEF0554E30 Datadog.Trace.TraceContext
-> 00007FBF701A2B80 Datadog.Trace.Span[]
Thanks @hiiru for providing all the extra details, this is really useful.
We think we've tracked down the source of the issue, unfortunately there appears to be multiple facets to this.
MongoClient
(incorrectly) flows the execution context from the current thread when it is being created. That means if there's an active trace at the time the MongoClient
is created, the background thread that checks the status of the cluster inherits that active trace. Even if the scope is closed on another thread, the background thread will keep adding spans to the trace.MongoClient
as a singleton service in ASP.NET Core's DI container, it is "lazily" created when it is first needed. That's very application-dependent, but it's likely to be inside a request trace or something similar, which means it's likely to capture a trace. Depending on the exact behaviour you're seeing, and the nature of the issue, there are various ways to mitigate/fix the issue.
MongoClient
This is the real solution to the problem, which would involve an update to the Mongo Csharp Driver to prevent flowing the execution context. I'd suggest filing a support case with Mongo - as a customer hopefully this will be prioritised!
MongoClient
ahead of timeDepending on your application, you may find that you can solve the problem by registering your singleton MongoClient
instance in DI directly. For example, if you were previously doing this:
public void ConfigureServices(IServiceCollection services)
{
services.AddSingleton<IMongoClient>(_ => new MongoClient(Configuration.GetConnectionString("mongo")));
}
Do this instead:
public void ConfigureServices(IServiceCollection services)
{
var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
services.AddSingleton<IMongoClient>(mongoClient);
}
This ensures that the context is captured when building Startup
, instead of when the client is first used. As long as you aren't creating long-lived scopes that encompass this part of the app lifecycle (which we would not recommend anyway) then this may solve your issue.
More generally, if you're not using this DI pattern, ensure that you're not creating the MongoClient
inside a "long-lived" trace/scope. If you are creating the client inside an existing scope, you can ensure you don't capture the context by calling ExecutionContext.SuppressFlow()
, for example:
public void ConfigureServices(IServiceCollection services)
{
using(_ = System.Threading.ExecutionContext.SuppressFlow())
{
var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
}
services.AddSingleton<IMongoClient>(mongoClient);
}
We generally advise against creating long-lived traces. By design, traces remain in memory until the last span is closed, so if you have a long-lived trace with many child spans, these will use an increasing amount of memory as the app lifetime increases.
In some cases, you can mitigate this issue by enabling partial flush for the .NET Tracer, as described in the documentation. If the above solution isn't possible or doesn't resolve the issue, it may be worth trying DD_TRACE_PARTIAL_FLUSH_ENABLED=true
.
There seems to be little utility in seeing these infrastructural isMaster
spans in DataDog, as they mostly seem to clutter up traces (completely aside from the memory issue described). Would you be interested in an option in the tracer that exclude these spans from automatic instrumentation by default (and other "infrastructural" commands), allowing enabling them with an environment variable?
That's good news @andrewlock!
I have traces/spans that run inside Configure
and ConfigureServices
so I can monitor how long those take, so workaround 2 won't help me directly. But I also have a clear spot where I create the MongoClient for the first time. If I add a call to ExecutionContext.SuppressFlow()
before I create the first MongoClient, would that clear the trace? Or is there a different explicit call to this Datadog.Trace library I can make to clear the trace? And is SuppressFlow()
what the MongoDB driver team would need to add to fix it?
Also, I see in the documentation that DD_TRACE_PARTIAL_FLUSH_ENABLED
should only be used "when you have long-lived traces or traces with many spans". Why? What artifacts would be introduced if we turned this on? More memory pressure because of more frequent (but smaller) allocations of spans?
I have traces/spans that run inside Configure and ConfigureServices so I can monitor how long those take, so workaround 2 won't help me directly. But I also have a clear spot where I create the MongoClient for the first time. If I add a call to ExecutionContext.SuppressFlow() before I create the first MongoClient, would that clear the trace?
Good point, yes, that should work π I'll update the above comment to mention that as an option for others in the thread.
And is SuppressFlow() what the MongoDB driver team would need to add to fix it?
Yes, I believe so (testing indicates that solves the issue too). I've created a PR for the fix but if you could raise a support issue as a customer that will very likely increase priorities there π
Also, I see in the documentation that DD_TRACE_PARTIAL_FLUSH_ENABLED should only be used "when you have long-lived traces or traces with many spans". Why? What artifacts would be introduced if we turned this on?
Partial flush is only supported in recent versions of the agent, so you'll need to make sure you're using version 7.26.0+. The main artifact you'll find is that child spans will appear in Datadog, with the parent span only arriving later. If you had a span that lasts the lifetime of the application, for example, it wouldn't be flushed until after the app shuts down for example.
I went with both suppressing execution context flow and DD_TRACE_PARTIAL_FLUSH_ENABLED=true
and I re-enabled the MongoDB integration in the integrations.json
with version 1.28.2
of the library. With those both in place, I do not see the memory leak!
Also, we use SignalR. By default, each SignalR connection (per user) has one long running trace where each call to the SignalR hub is a span and the trace doesn't complete until the user disconnects (which could be a long time). Turning on DD_TRACE_PARTIAL_FLUSH_ENABLED
seems to have dramatically reduced our memory overhead in this SignalR service. So there's a bonus!
I deployed a fix with SuppressFlow and re-enabled the integration few hours ago, it looks better. However the timeframe is still too short to say it definitely, I will give a update later this week.
Also the mongodb metrics are now tracked fine as expected. π
There seems to be little utility in seeing these infrastructural isMaster spans in DataDog, as they mostly seem to clutter up traces (completely aside from the memory issue described). Would you be interested in an option in the tracer that exclude these spans from automatic instrumentation by default (and other "infrastructural" commands), allowing enabling them with an environment variable?
Yes, that would be great. There are a lot of isMaster
spans and a few saslContinue
and buildInfo
spans which are only a background noise (didn't notice any other unexpected commands in our case yet). Alternatively a way to specify what to ignore would be fine as well (not sure if there is one already, didn't check it).
However from your linked list, please never hide those: createUser
and updateUser
(but possibly censor sensitive content like the password, didn't check what's traced here)
I assume those aren't used often in normal applications anyways, however in case there is a security incident, traces showing those might be an advantage for security forensics (not really an APM topic, still any information source might help, especially for smaller environments without a dedicated SIEM).
@tylerohlsen that's great news, thanks for the update! SignalR seems does indeed seem like a good case for the partial flush. Let us know if you run into any more issues.
I deployed a fix with SuppressFlow and re-enabled the integration few hours ago, it looks better.
Thanks @hiiru, hopefully this continues to look good after running for a while
Yes, that would be great
Thanks for the feedback, and I agree regarding the list. I think this would be a better candidate list to ignore:
hello
isMaster
saslStart
saslContinue
getnonce
buildInfo
getLastError
authenticate
After 2 days with the SuppressFlow
fix, the memory still looks good.
This definitely fixed the issue. π
@andrewlock Thank you for the solution and your MongoDb PR to fix it.
That's great news @hiiru, thanks for the update. In which case, I'll close this issue. Feel free to reopen if the issue reappears. Thanks.
Describe the bug When the MongoDb integrations are present in the dotnet-integrations.json, our memory usage slowly increases at an approximate average rate of 5MB per hour. This continues until it has accumulated enough memory for our tasks in ECS/Fargate to run out of memory and kill/restart our containers.
To Reproduce Steps to reproduce the behavior:
Screenshots
[]
)Runtime environment (please complete the following information):
mcr.microsoft.com/dotnet/aspnet:5.0
which is a Debian Buster-Slim distro