elastic / ecs-dotnet

https://www.elastic.co/guide/en/ecs-logging/dotnet/current/setup.html
Apache License 2.0
114 stars 58 forks source link

[BUG] Memory leak #341

Closed urielginsburg closed 5 months ago

urielginsburg commented 10 months ago

ECS integration/library project(s) (e.g. Elastic.CommonSchema.Serilog):

ECS schema version (e.g. 1.4.0): 8.6.1

ECS .NET assembly version (e.g. 1.4.2): 8.6.1

Elasticsearch version (if applicable): 8.6.2

.NET framework / OS: Core 7 / Any OS

During a soak test on an AKS cluster, during which we enabled verbose logging, our services reached staggering sizes of memory. I took a dmp with dotnet-dump and it turns out there are huge numbers of EcsDocument objects that remain in the generation 2 pool, even after they had been sent to elasticsearch (I verified several of the document instances actually existed in elasticsearch).

Steps to reproduce: As far as I can say, simply logging a big amount of log lines reproduces this behavior. I created a sample program to demonstrate this:

using Elastic.Serilog.Sinks;
using Serilog;

var memoryBefore = GC.GetTotalMemory(false);
Console.WriteLine($"Memory before: {memoryBefore} bytes");

Serilog.Debugging.SelfLog.Enable(s => Console.WriteLine(s));

Log.Logger = new LoggerConfiguration()
    .WriteTo.Elasticsearch(new[] { new Uri("http://localhost:9200") }, o =>
    {
        o.BootstrapMethod = Elastic.Ingest.Elasticsearch.BootstrapMethod.Failure;
        o.MinimumLevel = Serilog.Events.LogEventLevel.Verbose;
        o.DataStream = new Elastic.Ingest.Elasticsearch.DataStreams.DataStreamName("logs");
    }) 
    .CreateLogger();

var random = new Random();
Parallel.For(0, 10000, i =>
{
    var randomData = $"Logging information {i} - Random value: {random.NextDouble()}";
    Log.Information(randomData);
});

Log.CloseAndFlush();

await Task.Delay(TimeSpan.FromMinutes(1));

// Force garbage collection
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

var memoryAfter = GC.GetTotalMemory(false);
Console.WriteLine($"Memory after: {memoryAfter} bytes");

var memoryUsed = memoryAfter - memoryBefore;
Console.WriteLine($"Memory used: {memoryUsed} bytes");

Here is the output of 'dotnet-dump analyze' of one of the instances:

> gcroot 0x7F09547ABCD8
HandleTable:
    00007f4904e113a8 (strong handle)
          -> 7f0904c04428     System.Object[]
          -> 7f092d05caf8     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>
          -> 7f092d05cb60     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+PerCoreLockedStacks[]
          -> 7f092712fea0     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+PerCoreLockedStacks
          -> 7f092712feb8     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+LockedStack[]
          -> 7f0927130298     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+LockedStack
          -> 7f09271302b8     Elastic.CommonSchema.EcsDocument[][]
          -> 7f09567f2568     Elastic.CommonSchema.EcsDocument[]
          -> 7f09547abcd8     Elastic.CommonSchema.EcsDocument
pdeluca93 commented 10 months ago

I'm experiencing a very similar problem with .NET 6 dockerized apps:

My stack:

The problem:

At random points, my containers begin to consume an excessive amount of memory, even during periods of inactivity. The memory consumption steadily increases without limit.

I've observed the following error is logged when this problem begins:

2023-10-21T22:32:53.5989859Z Observed an exception while writing to logs-dotnet-default
2023-10-21T22:32:53.5990190Z System.ArgumentNullException: Value cannot be null. (Parameter 'source')
   at System.Linq.ThrowHelper.ThrowArgumentNullException(ExceptionArgument argument)
   at System.Linq.Enumerable.Where[TSource](IEnumerable`1 source, Func`2 predicate)
   at Elastic.Serilog.Sinks.SelfLogCallbackListener`1.<>c.<.ctor>b__42_1(BulkResponse response, IWriteTrackingBuffer _)
   at Elastic.Channels.BufferedChannelBase`3.ExportBufferAsync(ArraySegment`1 items, IOutboundBuffer`1 buffer)

As a workaround that decrease the impact (in my context) but doesn't solve the underlying issue, I've set this environment variable to disable the GC Server mode:

DOTNET_gcServer: 0

And applied memory limits to my containers.

As a side note, it's very frustrating that the sink lacks documentation on how to use Serilog.Settings.Configuration to apply all the settings from the appsettings files. I've figured out how to set some properties, but not how to configure BootstrapMethod, ConfigureChannel, and DataStream.

Mpdreamz commented 5 months ago

Thank you for reporting this to us and apologies for getting to it late.

This memory leak should now be fixed as of: https://github.com/elastic/ecs-dotnet/pull/371

All the documents are owned by the array pools backing System.Channel and get gen2'd rather quickly. We also have a high default for InboundBufferMaxSize of 100k which will eat memory for caching purposes. However since we use BoundedChannelFullMode.Wait, in cases where we extremely overproduce it can allocate way more then expected see: https://github.com/dotnet/runtime/blob/fadd8313653f71abd0068c8bf914be88edb2c8d3/src/libraries/System.Threading.Channels/src/System/Threading/Channels/BoundedChannel.cs#L609. As the channels will maintain their own DeQueue to push these messages at a later time.

While we were disposing the buffers already we localised their lifetimes more tightly so GC can release them faster: https://github.com/elastic/elastic-ingest-dotnet/pull/52

The callbacks no longer get passed an instance of the buffer either just a copy of its metadata: https://github.com/elastic/elastic-ingest-dotnet/pull/53

Both of these greatly ease the memory pressure.

Some more background into how the buffer sizes play into memory usages:

https://github.com/elastic/elastic-ingest-dotnet/pull/54