dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.82k stars 4.61k forks source link

Long individual GC pauses: how to investigate? #102657

Closed lsoft closed 4 weeks ago

lsoft commented 2 months ago

We found our system suffers for long GC pauses:

image

219 GC index pauses the app for 20 506 msec. To compare, the similar 1N GC Index 216 pauses only for 36 msec.

219 GC on graph:

image

216 GC on graph:

image

This happens only with server GC, workstation GC do not have such long pauses.

The app is for net7.0, latest runtime installed, OS Windows Server 2012, dedicated server, no containers, the app is working as windows service.

I'm a newbie in the field of GC issues investigation and little confused because of this picture, so I'm asking for tips: how to investigate this further?

Thanks!

mangod9 commented 2 months ago

Can you try collecting another trace with CPU samples as described here: https://devblogs.microsoft.com/dotnet/work-flow-of-diagnosing-memory-performance-issues-part-2/#continuing-the-analysis. That might provide some clues as to where the time in GC is being spent.

cshung commented 2 months ago

Can you share the raw XML view of that GC? You can access the raw XML by clicking the raw XML link on the GCStat page and search for the specific GC id.

lsoft commented 2 months ago

@mangod9 sure. I will rerun tracing at the start of next week! thanks!

@cshung sure. here is the file:

https://www.dropbox.com/scl/fi/ouuxur890lrin9k5cvtil/PerfViewData_0524_1255-1400_kernel.etl_b0e08c4.gcStats.16400.xml?rlkey=vdp7i08xpqqceq89pcmrcpwaj&dl=0

cshung commented 2 months ago

@mangod9 sure. I will rerun tracing at the start of next week! thanks!

@cshung sure. here is the file:

https://www.dropbox.com/scl/fi/ouuxur890lrin9k5cvtil/PerfViewData_0524_1255-1400_kernel.etl_b0e08c4.gcStats.16400.xml?rlkey=vdp7i08xpqqceq89pcmrcpwaj&dl=0

Something interesting is going on with your stack at the GC 219 time

This is what I found at GC 219, the problematic one.

<PerHeapHistory MarkStack ="8 109,383(0)" MarkFQ ="0,002(0)" MarkHandles ="0,039(520)" MarkOldGen ="0,007(0)">

This is what I found at GC 221, which is also a Gen1 GC

<PerHeapHistory MarkStack ="0,436(0)" MarkFQ ="0,002(0)" MarkHandles ="0,178(724260)" MarkOldGen ="1,595(4625295)">

Note the huge difference between the MarkStack time.

Any idea how your stacks look like at that time?

lsoft commented 2 months ago

@cshung sorry, I don't follow. I'm using perfview for my first time, and I see this xml for the first time too :)

are you asking for stacks of all running thread in the app at the moment of 219? the system is huge, no ideas about stacks.

because of reasons I don't understand, the etl file contains CPU stacks only for last minutes:

image

so, I guess, stacks are not recoverable from this etl file, and I need to repeat tracing with some different(???) command line...

lsoft commented 2 months ago

@cshung what is MarkStack? I will appreciate for any tips you provide where and what to search in the codebase!

cshung commented 2 months ago

@cshung what is MarkStack? I will appreciate for any tips you provide where and what to search in the codebase!

The .NET GC is a tracing GC, before the GC can reclaim memory, it must know where all the live references are so that it doesn't collect an object currently in use.

MarkStack corresponds to the process of going through the stacks and figuring out what objects (and the objects they are currently referencing) are live and mark them so that the collection process will not accidentally collect them.

Here is where the GC performs the mark stack operation

https://github.com/dotnet/runtime/blob/35e4aad602ee3c28330e94746cfd4e0d4569b66f/src/coreclr/gc/gc.cpp#L29485

Roughly speaking, that code in the GC call into the runtime to perform thread and stack frame enumeration, for each thread, for each frame, the runtime reports the report references to the GC through a callback, the callback will recursively mark the object as well the as the object it references.

As far as I can tell from your XML, the GC is spending majority of its time in that process.

From a user's perspective, you can't really control the GC, so it makes more sense to doubt its input (i.e. the thread, the frames and the objects). Maybe you have an abnormally large number of threads, number of frames, number of locals, or whatever we just don't know in the moment. That's why I am curious about what happens to the stacks at that point in time.

Is that long GC behavior something you can reliably reproduce? You might want to put in a poor man profiler (just time the operation before and after the line I talked about) and capture a dump right after that, then you can inspect the stack and the heap at that point to analyze

lsoft commented 2 months ago

@cshung thank you for info.

abnormally large number of threads

the timings differs by 8000x. I know nothing about GC implementation, but it look like a some lock... I don't know...

reliably reproduce

no, it is stochastic event, one item per hour or something.

I need to think about watch mark stack and make dump if the value is high. thank you for your help and this idea!

cshung commented 2 months ago

@cshung thank you for info.

abnormally large number of threads

the timings differs by 8000x. I know nothing about GC implementation, but it look like a some lock... I don't know...

reliably reproduce

no, it is stochastic event, one item per hour or something.

I need to think about watch mark stack and make dump if the value is high. thank you for your help and this idea!

If you are thinking along this line of capturing the event and make dumps, you might want to check this out.

https://cshung.github.io/posts/generation-aware-analysis/

and the associated PR

https://github.com/dotnet/runtime/pull/40332

Looking at the promoted bytes, it is unclear if generational aware analysis can help directly, but we might be able to extend it so that it threshold on the mark stack time as well.

lsoft commented 2 months ago

@cshung could you provide me a tip, how to interpret an event GCNumber= "283"?

it has PauseDurationMSec="17 793,178" but the maximum MarkStack is MarkStack ="1 013,295(2966)", other timings (MarkStack, MarkFQ, MarkHandles, MarkOldGen) in all heaps is less than 4 msec.

where did we spend a time in the event 283?

cshung commented 2 months ago

where did we spend a time in the event 283?

If I remember correctly, MarkStack is probably the lowest level of granularity we have in a GCCollectOnly trace, we don't have any more events fired in that line. As a result, the trace can't tell you anymore what happened within MarkStack.

Context: If you recall, I said when we mark the live references from the stack, the GC calls out to the runtime to enumerate the thread and stacks, technically, that part is outside of the GC, that is why a GCCollectOnly trace don't cover anything inside it. There are routinely millions of objects being marked, so we can't trace inside the marking routines either.

We do have an aggregate - throughout the whole mark_phase, a number of objects are marked. For GC internal purposes, we calculate a number called promoted_bytes, which is roughly the sum of the sizes of all objects marked, and we report it through the GCStats. That should give you an aggregated idea on what the volume of work was done. Caveat to note is that the number includes object marked from sources other than thread and stack (e.g. handles, finalization queues and perhaps others).

could you provide me a tip, how to interpret an event GCNumber= "283"?

Short answer: You interpret the XML as is - trust what it says.

PauseDurationMSec means exactly what it says, the pause duration in milliseconds. If you trust the code, and you are okay with the abstraction level (i.e. you can intuitively accept the notion of duration paused by the GC without going into deeper details, you can skip reading now)

Long answer: Here is how you figure it out

Assuming that either you don't trust that number is a pause duration, or you wanted to know precisely what does pause duration means, you can go back to the code to backtrack what it is.

To start with, the XML is produced using the GcStats.ToXml, where the GCEvent tag is produced by GcStats.ToXmlAttribs, where the value of PauseDurationMSec comes from gc.PauseDurationMSec.

https://github.com/microsoft/perfview/blob/0a35b99ccf10559c4e85b55584a4f31ee375c217/src/PerfView/GcStats.cs#L446C28-L446C45

Next, the gc variable has type TraceGC, so we look into how that field get populated. There are a lot of references to the field, fortunately only 3 of them are assignments. The one you wanted to look into is this one.

https://github.com/microsoft/perfview/blob/0a35b99ccf10559c4e85b55584a4f31ee375c217/src/TraceEvent/Computers/TraceManagedProcess.cs#L658

Looking at the definition of TimeStampRelativeMSec, that value comes from the tracing infrastructure, which basically says the tracer recorded the timestamp of when an event was emitted.

In this particular block of code, the data is GCRestartEEStop.

The _gc.PauseDuration is computed similarly here.

https://github.com/microsoft/perfview/blob/0a35b99ccf10559c4e85b55584a4f31ee375c217/src/TraceEvent/Computers/TraceManagedProcess.cs#L734

This time around the data is GCStart.

Therefore, the pause duration is precisely the time between the GCStart event and the GCRestartEEStop event.

With that, we can go to the GC code to know exactly where it is, as you probably know by now, the runtime code is a lot harder to work with. Code reading only take you so far, the runtime use a combination of python code generation and C macro magic, the best chance to work it is to make some guess on what function is going to be called and use a debugger. With that, I have found out these:

Here is a stack on how the GCStart event is fired:

clrgcexp!GCEventFireGCStart_V2<unsigned int,unsigned int,unsigned int,unsigned int>+0x87 [C:\Dev\runtime\src\coreclr\gc\gcevents.h @ 11] 
clrgcexp!SVR::GCHeap::UpdatePreGCCounters+0x9e [C:\Dev\runtime\src\coreclr\gc\gcee.cpp @ 51] 
clrgcexp!SVR::gc_heap::do_pre_gc+0x2a4 [C:\Dev\runtime\src\coreclr\gc\gc.cpp @ 49850] 

Unlike GCStart, where there is an obvious call in the runtime named exactly the same, if you search for GCRestartEEStop in the code base, there is none in the native runtime.

To search for this one, you look at the definition of GCRestartEEStop in PerfView here.

https://github.com/microsoft/perfview/blob/0a35b99ccf10559c4e85b55584a4f31ee375c217/src/TraceEvent/Parsers/ClrTraceEventParser.cs#L340

Note the 132 and correlate it with the event manifest, we have found it here.

https://github.com/dotnet/runtime/blob/0588f245dba6418349c22894b40337afc34f1cfc/src/coreclr/vm/ClrEtwAll.man#L103

So the runtime call that GCRestartEEEnd instead of GCRestartEEStop, now we can find the call site to it.

coreclr!FireEtwGCRestartEEEnd_V1 [C:\Dev\runtime\artifacts\obj\coreclr\windows.x64.Debug\inc\clretwallmain.h @ 131] 
coreclr!ThreadSuspend::RestartEE+0x17f [C:\Dev\runtime\src\coreclr\vm\threadsuspend.cpp @ 5484] 
coreclr!GCToEEInterface::RestartEE+0x42 [C:\Dev\runtime\src\coreclr\vm\gcenv.ee.cpp @ 60] 
coreclr!standalone::GCToEEInterface::RestartEE+0x18 [C:\Dev\runtime\src\coreclr\gc\env\gctoeeinterface.standalone.inl @ 19] 
clrgcexp!GCToEEInterface::RestartEE+0x58 [C:\Dev\runtime\src\coreclr\gc\gcenv.ee.standalone.inl @ 36] 
clrgcexp!SVR::gc_heap::gc_thread_function+0xa63 [C:\Dev\runtime\src\coreclr\gc\gc.cpp @ 7230] 

So that should answer the question completely what does it mean by PauseDurationMSec. It is the time spent in GC between line 49850 to line 7230, with a caveat that it includes some overhead on the time spent on the stack frames shown above.

If you think this is too tedious, I feel the same, unfortunately this is what it is right now if you need the precision.

lsoft commented 2 months ago

@cshung oh, thanks a lot for this detailed answer! 👍

So, my plan is 1) exclude other process interferention 2) get the dump right after long MarkStack event, but before that GC event being completed and then try to analyze the dump.

cshung commented 2 months ago

@lsoft This particular session in the mem-doc might help, there is a PerfView option that allows you to get finer info when we reach the random long GC.

https://github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md#debugging-a-random-long-gc

lsoft commented 2 months ago

@mangod9

Can you try collecting another trace with CPU samples as described here

what is better to catch: /KernelEvents=default+Memory+VirtualAlloc or /KernelEvents=ThreadTime+Memory+VirtualAlloc is addition to /ClrEvents:GC+Stack?

Anyway, I have both etl traces. In the first I see 9000 msec gen1 pause with low MarkStack timings, in the second I see 14000 msec pause with MarkStack around 5000 msec for each thread.

I tried to follow the workflow(https://devblogs.microsoft.com/dotnet/work-flow-of-diagnosing-memory-performance-issues-part-2/#continuing-the-analysis) but with no success. This magic I cannot adopt)

What are my next steps?

mangod9 commented 2 months ago

are you able to share some of your traces?

dotnet-policy-service[bot] commented 1 month ago

This issue has been marked needs-author-action and may be missing some important information.

dotnet-policy-service[bot] commented 1 month ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

dotnet-policy-service[bot] commented 4 weeks ago

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.