Closed pharring closed 4 years ago
The thought occurred to me that PerfView is just the tool for tracking down the heap allocations. Indeed, here are the relevant allocating stacks:
Name | Inc % |
---|---|
|| + ntdll!RtlpAllocateHeapInternal | 98.6 |
|| + coreclr!EEHeapAllocInProcessHeap | 98.6 |
|| + coreclr!operator new[] | 50.2 |
|| |+ coreclr!EventPipeEvent::EventPipeEvent | 50.2 |
|| ||+ coreclr!EventPipeProvider::AddEvent | 50.2 |
|| || + coreclr!EventPipeInternal::DefineEvent | 50.2 |
|| || + system.private.corelib!IEventProvider.DefineEventHandle | 50.2 |
|| || + system.private.corelib!NameInfo.GetOrCreateEventHandle | 50.2 |
|| || + system.private.corelib!EventSource.WriteImpl | 50.2 |
|| || + ?!? | 50.2 |
|| |+ coreclr!CStructArray::Grow | 0.0 |
|| + coreclr!operator new | 48.4 |
|| + coreclr!EventPipeProvider::AddEvent | 48.4 |
|| + coreclr!EventPipeInternal::DefineEvent | 33.8 |
|| |+ system.private.corelib!IEventProvider.DefineEventHandle | 33.8 |
|| | + system.private.corelib!NameInfo.GetOrCreateEventHandle | 33.8 |
|| | + system.private.corelib!EventSource.WriteImpl | 33.8 |
|| | + ?!? | 33.8 |
|| + coreclr!EventPipeProvider::AddEvent | 14.6 |
|| + coreclr!EventPipeInternal::DefineEvent | 14.6 |
|| + system.private.corelib!IEventProvider.DefineEventHandle | 14.6 |
|| + system.private.corelib!NameInfo.GetOrCreateEventHandle | 14.6 |
|| + system.private.corelib!EventSource.WriteImpl | 14.6 |
|| + ?!? | 14.6 |
@pharring We should probably move this bug to the coreclr repo since it is almost certainly going to be fixed by modifying the .NET Core runtime.
@pharring I'm taking a look.
FYI, I was able to eliminate PerfView from the repro by building my own TraceEvent-based listener. I also got a simplified repro in Azure App Service. They all have the same symptoms, pointing to this as the root cause.
Thanks @pharring, I am able to repro it locally with the code above.
Including @brianrob who wrote this.
I took a quick look and the key place is the stack
Name
--
\|\| \|\|\|\|\|\| \|+ system.private.corelib!System.Diagnostics.Tracing.EventSource.Write[System.__Canon](System.String, System.Diagnostics.Tracing.EventSourceOptions, System.__Canon)
\|\| \|\|\|\|\|\| \|\|+ system.private.corelib!EventSource.WriteImpl
\|\| \|\|\|\|\|\| \|\|\|+ system.private.corelib!NameInfo.GetOrCreateEventHandle
\|\| \|\|\|\|\|\| \|\|\|\|+ system.private.corelib!IEventProvider.DefineEventHandle
\|\| \|\|\|\|\|\| \|\|\|\|\|+ coreclr!EventPipeInternal::DefineEvent
\|\| \|\|\|\|\|\| \|\|\|\|\| + coreclr!EventPipeProvider::AddEvent
Which says that starting with an event write, we try to create a handle for it (for EventPipe), and this seems to allocate ALL THE TIME. However a handle to an event is about its Meta-Data, and thus there should be a lookup and MOST of the time we should NOT be creating new event meta-data.
Thus the basic issue is that we are missing our 'lookup' that would find that we already have meta-data for this event.
@brianrob - Can you quickly tell us where it SHOULD be happening?
There is a secondary issue that we should not need this meta-data at all unless EventPipe is turned on (which it is not), so it would be nice if we avoided even generating the handle if EventPipe is off.
@jorive - We should also check the scenario where we are collecting with EventPIpe instead of ETW before we call this bug fixed.
I'm unable to repro in netcoreapp3.0
, and https://github.com/dotnet/coreclr/pull/18355 seems to have fixed it.
@jorive - I assume you are going to try back-porting that fix (it is not that big).
@pharring - assuming that the fix above fixes things, the main thing we need to figure out is the impact of the bug. Can you detail the impact so that @jorive and make the case to the triage group.
In particular, in a 'normal' use of Application Insights, how bad is the leak? I am assuming that it is big enough that it forces customers to turn off App-Insights. It will also affect any other data collection (thus things leak if you also collect with PerfView).
We should err on the side of fixing this, Looking at https://github.com/dotnet/core/blob/master/release-notes/download-archive.md, we have already released 3 updates to 2.2. and presumably we will have another within a month. That will likely live at least 6 more months before V3.0 becomes available, so I think it is worth it.
@vancem, you're right that this is about metadata, and prior to dotnet/coreclr#18355, the cache was essentially busted and so we always created a new metadata blob. With dotnet/coreclr#18355 this should no longer happen and only a single blob per event should be created.
It is likely possible to avoid creating the metadata if EventPipe isn't enabled, but if I recall correctly, I did the metadata generation regardless because it was hard to get to a place outside of event initialization that had enough of the data required to produce the metadata.
FWIW, our team had to remove all usage of EventPipe used for diagnostics in our product, as the memory leak was extreme and would OOM kill our process every couple of hours. Backporting any fixes to 2.2 would be much appreciated.
as the memory leak was extreme and would OOM kill our process every couple of hours. Backporting any fixes to 2.2 would be much appreciated.
+1 Waiting for 3.0 to be fully shipped=>fully tested for our app=>being available in azure is a long path to go
I am assuming that it is big enough that it forces customers to turn off App-Insights.
It is actually depends on the load. The leak is about ~200 bytes per event as I understand
@jorive - it sounds like this bug is rather severe. I think we have a strong case to back-port it to V2.2. We should pursue that.
Agreed
I'm preparing port to 2.2 right now.
Any idea if this could be available on Azure at all? I mean deployed to Azure and so could be used in App Service? If yes then how long does it takes usually? :)
Seems like issue is still reproducible on 2.2.5. I'm using EventListener for monitoring events and when using it, memory usage in dotnet.exe climbs very fast
Host dotnet --info:
Host (useful for support): Version: 2.2.5 Commit: 0a3c9209c0
.NET Core SDKs installed: No SDKs were found.
.NET Core runtimes installed: Microsoft.AspNetCore.All 2.2.5 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All] Microsoft.AspNetCore.App 2.2.5 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.2.5 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
We are also still seeing this issue on 2.2.5. Exact same dotnet --info output as above.
@Spaniol Does it cap around ~1GB (This is the default internal circular buffer allocated/used)?
@jorive yeah, i believe it may be the case, left service for 3 days without any traffic and it used around 900 mb of memory, while usually it uses around 150 mb.
That memory is reused to buffer events before writing to file, and it should not grow beyond the specified value. It's not a memory leak.
Is this memory not used when you disable EventPipe?
That memory is reused to buffer events before writing to file, and it should not grow beyond the specified value. It's not a memory leak.
How big is the buffer in 2.2?
PS: Has this changed in 3.0, taking in to account changes related to the Docker ?
@mjsabby memory isn't used when pipe is disabled, when enabled memory usage climbs like crazy under traffic, while heap usage remains at same level @jorive is buffer size configurable? It looks like it's little bit to much, use 1 gb of memory if you want to gather some basic metrics for dashboard and monitoring
@Spaniol Yes, there are few ways to configure the size of the buffer, and depends on how are you enabling EventPipe tracing in your project.
COMPlus_EventPipeCircularMB
(default is COMPlus_EventPipeCircularMB=1024
)App.eventpipeconfig
file drop, then you can add an entry to the file CircularMB
(if not provided, it defaults to CircularMB=1024
)@jorive sounds promising! i'll try configuration via env variables tomorrow and post results here in case if anyone will have such troubles, i can see that ticket for documentation related to EventPipe's is still open, so it's difficult to figured it out from the start, but the case to use them for basic monitoring will be common
@Spaniol For 3.0 we added runtime event counters, which is far more low-overhead for purposes like simple APMs or dashboards for "basic monitoring" scenarios. Unfortunately backporting that to 2.2 is far more costly than we can afford, but just wanted to let you know that better alternatives do exist in 3.0 :)
Is this memory not used when you disable EventPipe?
This memory is deallocated when the tracing session is disabled.
I created dotnet/coreclr#25413 to track this. FWIW, I think 256MB is probably too big of a default and we should look into changing it.
I believe 256MB is the default used by PerfView, but I agree we should consider lowering it.
Our microservices were growing at a rate of 40MB per hour (each) because of this issue. We have turned off the Prometheus.Net collectors that hook up to the event sources as a temporary workaround (https://github.com/djluck/prometheus-net.DotNetRuntime/issues/6).
This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!
I have confirmed .Net Core 3 - Preview 8 does not experience the same issue. I have also confirmed that .Net Core 2.2.0 - 2.2.6 do experience the issue. I have the same stack trace for the allocations as mentioned in the first reply above.
Please back-port the fix to .Net Core 2.2.
cc @josalem
This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!
That definitely doesn't sound normal to me either.
I have confirmed .Net Core 3 - Preview 8 does not experience the same issue.
This makes me less nervous. That being said, AFAIK @jorive already ported his fix (https://github.com/dotnet/coreclr/pull/23661) to 2.2 and if it's not addressing the issue, we might be looking at a different issue from one that was already ported.
Do you happen to have a memory dump and/or a small repro that I can use to repro the behavior?
Sorry, I do not think I can share the full memory dump, but here is some data from the memory dump and I can get you more data upon request.
0:000> !heap -stat -h 000002b837d30000
heap @ 000002b837d30000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
19000 cbf9 - 13eb51000 (96.11)
32000 3ae - b7fc000 (3.47)
258000 2 - 4b0000 (0.09)
38 d482 - 2e7c70 (0.05)
30 eb62 - 2c2260 (0.05)
0:000> !heap -p -a 000001d85f2a5f80
address 000001d85f2a5f80 found in
_HEAP @ 1d3c2ff0000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
000001d85f2a5f50 1903 0000 [00] 000001d85f2a5f80 19000 - (busy)
7ff9014cff83 ntdll!RtlpCallInterceptRoutine+0x000000000000003f
7ff901494423 ntdll!RtlpAllocateHeapInternal+0x0000000000078ba3
7ff88c8755b9 coreclr!EEHeapAllocInProcessHeap+0x0000000000000029
7ff88c875571 coreclr!operator new[]+0x0000000000000025
7ff88caebd12 coreclr!EventPipeBuffer::EventPipeBuffer+0x0000000000000016
7ff88ca7da1a coreclr!EventPipeBufferManager::AllocateBufferForThread+0x00000000000001e2
7ff88ca7e2be coreclr!EventPipeBufferManager::WriteEvent+0x00000000000000e6
7ff88ca0dd5e coreclr!EventPipe::WriteEventInternal+0x00000000000000be
7ff88c8e3e9c coreclr!EventPipe::WriteEvent+0x000000000000003c
7ff88c9643c5 coreclr!FireEtwThreadCreated+0x0000000000120415
7ff88c843fa3 coreclr!ETW::ThreadLog::FireThreadCreated+0x0000000000000043
7ff88c843ef7 coreclr!Thread::HasStarted+0x00000000000000ef
7ff88c83a42e coreclr!ThreadNative::KickOffThread+0x00000000000000be
7ff88c83a34a coreclr!Thread::intermediateThreadProc+0x000000000000008a
7ff8ffa37bd4 kernel32!BaseThreadInitThunk+0x0000000000000014
7ff90144ce71 ntdll!RtlUserThreadStart+0x0000000000000021
As you can see, at the time of that memory dump, there was 52,217 occurrences of these 102,400 byte objects that take up a total of 5,347,020,800 (5.3 GB) of memory. I spot checked some of the same size objects in a debug instance and they all had the same allocation stack trace listed here (sorry for the mix-up, my stack trace is different than the OP's stack trace).
The repro was pretty simple. I made an Asp.Net Core 2.2.4 targeted application that was self hosted as a console app and self contained with win7-x64
as the runtime target. Then I added a nuget reference to the latest of prometheus-net.DotNetRuntime
and called DotNetRuntimeStatsBuilder.Default().StartCollecting();
in Startup.Configure.
I can create a project like that and attach a zip if you'd like.
Attached is a simple app the reproduces my issue. EventPipeMemoryLeak.zip
I just let the app sit idle and the memory usage is continuously increasing. And the heap stats show a regular increase in quantity of these 0x19000 size objects.
@sywhang and @josalem ^^
@tylerohlsen I'm running the repro app and trying to see how it ends up in that state. So far I'm not seeing anything that's not expected - I'm waiting to see if the memory usage grows beyond 1GB which is what EventListener sessions have as the max limit.
That being said the 0x19000 size events does correspond to the EventPipe buffer blocks that we allocate internally to store the events. The repro app's been running for about ~1.5 hours now and it's at 380MB - I will come back in a few hours and see how things have changed and take it from there.
@tylerohlsen I moved the issue to dotnet/coreclr#26344 because this issue is different from the original issue. I'll continue posting things in that thread.
Closing this issue since the original issue was fixed in both master/2.2.
Reported originally at https://github.com/Microsoft/ApplicationInsights-dotnet/issues/1102
Repros on .NET Core 2.2, x86 and x64 on Windows.
Build the following application:
Run it on Windows and watch the memory usage of dotnet.exe in Task Manager. It settles down on my machine around 10 MB.
Now enable the ETW event listener for the event source. The easiest way to do that is to run PerfView. Press Alt+C to start a collection and make it look like this:
The important thing is to add this in Additional Providers:
PSH-LeakTest::Informational
You can turn off all the other options. Now, when you start logging, watch the memory usage in dotnet.exe climb like crazy (over 200 MB in 5 seconds). When you stop logging, the memory isn't freed. If you create a dump of the dotnet.exe process, the GC heap is fine. Most of the additional memory is in native heap and it seems to consist of the event manifests repeated over and over again.This does not reproduce on .NET 4.7.2 I don't know about earlier .NET Core runtimes.
Please let me know if you need any more information, but I think this is a pretty minimal repro with clear instructions. One thing I could do is eliminate PerfView by writing a custom app based around TraceEvent or even a WPR profile.
To be clear how this affects Application Insights customers: The Application Insights Profiler runs in Azure App Services. It captures ETW sessions for 2 minutes at a time about once an hour. One of the providers it enables (listens to), is the Application Insights "Rich Payload Event Source" (see the original bug in the Application Insights repo). Customers are now encountering these memory leaks under load when profiling is running leading to application restarts.