dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.

MIT License

1.19k stars 354 forks source link

Dotnet-trace collect command does not respect "--duration" parameter #1706

Open iliamosko opened 4 years ago

iliamosko commented 4 years ago

Configuration

.NET Version: 3.1.403 dotnet-trace version: 3.1.141901+3f460984c62efa118b6bd87fa4c5f07b11074b34 OS: Linux Dockerized? The app is running in a docker container and the trace is taken within the container. Machine Load: High. Traces are taken during high load scenarios.

Description

Commands used: 1: ./dotnet-trace collect --process-id 1 --duration 00:00:00:30 --buffersize 1024 -o /home/application/app-trace/TRACE.nettrace Size of trace: 101MB.

2: ./dotnet-trace collect --process-id 1 --providers Microsoft-DotNETCore-SampleProfiler --clrevents GC+GCHeapSurvivalAndMovement+Stack+Loader+GCHandle+Type+GCHeapDump+GCHeapCollect+GCHeapAndTypeNames+Exception --clreventlevel 5 --duration 00:00:00:10 --buffersize 1024 -o /home/application/app-trace/TRACE.nettrace Size of trace: 7GB

While running these commands, the duration parameter was not honored, and the collect command kept on running until CTRL+C was pressed to cancel the operation. There were some instances where the first command would finish as it should, but there would be no information in the trace log.

This seems like a problem when collecting a trace when the machine is under high load. Also, when trying to open the large trace file(7GB), it would take substantially long for PerfView to parse the trace. This could also be a factor of collecting a trace with a high machine load. I have tested these commands under normal machine load and it is working fine. However, I did notice that whenever you click on the CMD/Terminal window while the collect command is running, the window seems to "Freeze" and the collect command will always go over the duration that is specified.

hoyosjs commented 4 years ago

However, I did notice that whenever you click on the CMD/Terminal window while the collect command is running, the window seems to "Freeze" and the collect command will always go over the duration that is specified.

This is expected an behavior of conhost if you are talking about Windows.

hoyosjs commented 4 years ago

@sywhang @josalem Isn't it possible that this is just a very lengthy case of rundown? Or is it more likely that there's so much load that it took a while for the timer in the cancellation token to fire and then rundown just lags because of the CPU constraints?

sywhang commented 4 years ago

Could be either but I'm guessing it's the rundown that's taking the majority of the time. Stopping the trace on dotnet-trace sends a rundown command to the target app which can take a while - we've seen it take minutes on certain apps with a lot of type definitions to resolve. If the app is under CPU pressure I wouldn't be surprised if it takes tens of minutes.

Also, when trying to open the large trace file(7GB), it would take substantially long for PerfView to parse the trace.

This is normal too, although I'm not sure what order of magnitude of time we're talking by "substantially long". Parsing a trace file as big as 7GB is not something I'd expect PerfView to take in the order of seconds - my guess is it'd at least take several minutes to do this. It needs to first write out the ETLX format file, then parse that again to form TraceLog object to be able to display it fully.

iliamosko commented 4 years ago

@hoyosjs

This is expected an behavior of conhost if you are talking about Windows.

This issue also happens when running dotnet-trace under WSL2. Not sure if that is also the expected behavior or not.

@sywhang The problem so far is that when collecting a trace, with a specified duration and with the machine being under high load, the duration of the trace always goes over the specified duration. When seeing that the duration of the trace is taking longer than it should, CTRL+C is pressed and the trace "zips" within a reasonable amount of time(~30-60 secs), as expected.

When the trace duration surpasses the specified duration, there is no indication that the trace is being stopped and is taking the time to compile the logs. The timer and the size of the trace increases until CTRL+C is pressed then afterwards the message pops up, and the trace is completed.

hoyosjs commented 4 years ago

@hoyosjs This issue also happens when running dotnet-trace under WSL2. Not sure if that is also the expected behavior or not.

It's not really about the shell you are using (WSL2, Bash, Powershell, CMD), but rather the console that's hosting it. The default console host on Windows has that behavior.

baal2000 commented 4 years ago

@iliamosko > CTRL+C is pressed and the trace "zips" within a reasonable amount of time(~30-60 secs), as expected.

@hoyosjs @sywhang Is CTRL+C combination a correct way to stop the trace? The application checks for ConsoleKey.Enter

The below file was produced when collecting .NET with 10 seconds duration parameter from inside a Docker container containing the .NET Core process under Linux.

It could not be opened in PerfView on Windows

The error log:

Started with command line: "C:\Program Files\PerfView.exe" "C:\Users\...\Downloads\example.nettrace"
PerfView Version: 2.0.46  BuildDate: Thu, 24 Oct 2019 10:54:49 GMT
PerfView Start Time 11/11/2020 3:15:46 PM
Started: View
Warning: PdbScope not found at C:\Program Files\PerfViewExtensions\PdbScope.exe
Disabling the Image Size Menu Item.
Warning: ILSize not found at C:\Program Files\PerfViewExtensions\ILSize.dll
Disabling the IL Size Menu Item.
Completed: View   (Elapsed Time: 0.114 sec)
Started: Opening example.nettrace
Creating ETLX file C:\Users\sergepavlov\AppData\Local\Temp\PerfView\example_9c22195.etlx from C:\Users\sergepavlov\Downloads\example.nettrace
Collecting a maximum of 20,000,000 events.
[Opening a log file of size 0 MB.]
[ERROR: processing events ****]
***** Consider using /ContinueOnError to ignore the bad part of the trace.  *****
Error: Exception EventPipe conversion: System.NotSupportedException: Streams larger than 4 GB.  You need to use /MaxEventCount to limit the size.
   at FastSerialization.IOStreamStreamWriter.GetLabel()
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLog.<>c__DisplayClass83_0.<CopyRawEvents>b__1(TraceEvent data)
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.DoDispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Dispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.EventPipeEventSource.DispatchEventRecord(EVENT_RECORD* eventRecord)
   at Microsoft.Diagnostics.Tracing.EventPipeEventSource.EventCache_OnEvent(EventPipeEventHeader& header)
   at Microsoft.Diagnostics.Tracing.EventPipe.EventCache.SortAndDispatch(Int64 stopTimestamp)
   at Microsoft.Diagnostics.Tracing.EventPipe.EventCache.ProcessSequencePointBlock(Byte[] sequencePointBytes)
   at Microsoft.Diagnostics.Tracing.EventPipeSequencePointBlock.ReadBlockContents(PinnedStreamReader reader)
   at Microsoft.Diagnostics.Tracing.EventPipeBlock.FromStream(Deserializer deserializer)
   at FastSerialization.Deserializer.ReadObjectDefinition(Tags tag, StreamLabel objectLabel)
   at FastSerialization.Deserializer.ReadObject()
   at Microsoft.Diagnostics.Tracing.EventPipeEventSource.Process()
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLog.CopyRawEvents(TraceEventDispatcher rawEvents, IStreamWriter writer)
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLog.<>c__DisplayClass118_0.<FastSerialization.IFastSerializable.ToStream>b__0()
   at FastSerialization.DeferedRegion.Write(Serializer serializer, Action toStream)
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLog.FastSerialization.IFastSerializable.ToStream(Serializer serializer)
   at FastSerialization.Serializer.WriteObjectData(IFastSerializable obj, Tags beginTag)
   at FastSerialization.Serializer.WriteObjectRef(IFastSerializable obj, Boolean defered)
   at FastSerialization.Serializer..ctor(IStreamWriter writer, IFastSerializable entryObject)
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLog.CreateFromEventPipeEventSources(TraceEventDispatcher source, String etlxFilePath, TraceLogOptions options)
   at Microsoft.Diagnostics.Tracing.Etlx.TraceLog.CreateFromEventPipeDataFile(String filePath, String etlxFilePath, TraceLogOptions options)
   at PerfView.EventPipePerfViewData.GetTraceLog(TextWriter log)
[Error: exception while opening EventPipe data.]
Completed: Opening example.nettrace   (Elapsed Time: 62.843 sec)

Could this behavior be related to the runtime environment?

josalem commented 4 years ago

dotnet-trace will stop tracing on the following condition

https://github.com/dotnet/diagnostics/blob/fb65a2c90acf744a7ea5051f313875f3425fba2e/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs#L228

ctrl-c sets the cancellation token which has been registered to set the shouldExit ManualResetEvent: https://github.com/dotnet/diagnostics/blob/fb65a2c90acf744a7ea5051f313875f3425fba2e/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs#L160

That same MRE is being set by the duration timer if a timeout is specified at the command line: https://github.com/dotnet/diagnostics/blob/fb65a2c90acf744a7ea5051f313875f3425fba2e/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs#L184-L189

The console output says you can stop with either Enter or ctrl-c (more specifically a SIGINT): https://github.com/dotnet/diagnostics/blob/fb65a2c90acf744a7ea5051f313875f3425fba2e/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs#L221

Note that "stopping" the trace just sends the StopTracing command to the runtime. In order to have symbol information, a rundown of data is then sent. We display a message when this has started: https://github.com/dotnet/diagnostics/blob/fb65a2c90acf744a7ea5051f313875f3425fba2e/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs#L224-L225

For sufficiently large applications (many, many type/method definitions), this can take a non-trivial amount of time.

The below file was produced when collecting .NET with 10 seconds duration parameter from inside a Docker container containing the .NET Core process under Linux.

How long did the trace collect for to grow to ~10 GiB? I'm assuming this was longer than 10 seconds.

When the timer goes past the requested duration runs out, does the text output show the message above about "stopping the trace"?

The exception you're seeing in the PerfView log is due to the file being so large, not due to the runtime environment.

baal2000 commented 4 years ago

@josalem Thanks for explaining how stopping is performed. We will get back after collecting the necessary screenshots and also trying running the tool with different argument sets.

One question: is there a parameter (could be unpublished) that, like CPUSampleMSec in PervView, limits the amount of collected CPU data points?

Update: Could --clreventlevel argument control that? There is not much information about that parameter. What is the recommended value for it?

Update 1 Here is how --clreventlevel is processed: private static EventLevel GetEventLevel(string token)

where EventLevel is defined as https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.tracing.eventlevel

Level "5" would be "Verbose". We should try "4" ("Informational") and see what happens.

josalem commented 3 years ago

One question: is there a parameter (could be unpublished) that, like CPUSampleMSec in PervView, limits the amount of collected CPU data points?

The frequency of CPU sampling isn't variable currently.

Level "5" would be "Verbose". We should try "4" ("Informational") and see what happens.

That adjusts the event level for a provider which doesn't affect CPU sample frequency.

When the specified duration in your scenario runs out, do you see the "stopping the trace" text in the output?

iliamosko commented 3 years ago

@josalem When the timer goes past the specified duration, there is no sign of "stopping the trace" text. Usually the session window will not update or allow any interaction until the trace is finished after pressing "CTRL+C" or "ENTER".

josalem commented 3 years ago

Usually the session window will not update

Do the trace size and timer in the stdout output increase over time? If those aren't increasing either, it sounds like you might somehow be pausing execution of dotnet-trace like @hoyosjs suggested above (by clicking in the console window in a conhost powered terminal on Windows).

@sywhang and I tried to reproduce the issue and couldn't.

baal2000 commented 3 years ago

The unresponsive trace app could be a victim of being overwhelmed by the sheer volume of the memory allocation-related data (--clreventlevel 5) and no known support for "sampled" collection of the memory events (likes of DotNetAllocSampled in perfview). Also mentioned in https://github.com/dotnet/diagnostics/issues/1744.

Once the memory - based providers are turned off or clreventlevel lowered to 4 then everything is back to normal. The file size becomes reasonably small, the application honours the timer setting. With the level 4 (information) though some of the useful memory allocation data is missing from the trace, that is unacceptable to our purpose.

@josalem and @sywhang To reproduce you should stress test by running the trace for a very large heap, high-allocation rate .NET Core application with clreventlevel 5 and large enough buffersize that ensures that no events are being skipped due to the buffer being overflown. And please take a look at https://github.com/dotnet/diagnostics/issues/1744.

Thank you.

baal2000 commented 3 years ago

Could we raise priority of this issue and .NET Core tooling in general, especuially when running large scale production software in Linux environment?

The tool works, but only to a certain point and then don't once the system becomes too CPU/Memory intensive. To put int into a context, we are talking of 100's GB RAM-large apps running 64 and higher number of CPU cores at 80%. At that level it is just not possible to get the process events unless dotnet-trace is able to sample the data for statistical purposes at a slower pace Perfview's DotNetAllocSampled and CPUSampleMSec parameters do. The only workaround today is setting small buffersize that causes the events becoming lost, but that leads to statisticaly incorrect results and in our understanding is not a recommended approach.

I am wondering if these are the scenarios that the team has planned the tools for or we are outliers. Then we will be happy to help.

josalem commented 3 years ago

Thanks for the details. I'd like to dig in a little more and get a better sense of your situation.

Eventing is never free. There will always be some amount of overhead no matter how slim the system designers attempt to make it. In your case, you happen to have a system that is being pushed to the limits during regular execution, then you're turning on some of the most expensive providers the runtime has--clreventlevel 5 which includes very verbose GC and Alloc events at the same time as CPU sampling. This is going to use a dedicated thread in your process that egresses events and by using dotnet-trace you're adding a few more threads externally. The GC and Alloc events at the level you have chosen are also quite expensive when enabled. This could easily result in data throughput that requires more resources to handle than the system has left while your application is running. These things combined could saturate a system that is "just the right size" for the target application alone. That being said, there have been some serious strides in the performance of EventPipe under high load in many-core environments in .net6.

Have you looked at using event counters instead of collecting the GC and Alloc events directly? You can monitor these characteristics of your application with a far, far lower cost that turning on the events directly by using counters. For reference, ASP.NET runs their benchmarking suites with counters turned on.

We don't currently have a lever for adjusting the sample rate of the CPU sampler, but that is something we are looking at adding/streamlining in .net7.

I am wondering if these are the scenarios that the team has planned the tools for or we are outliers. Then we will be happy to help.

I would say we planned for these scenarios (high thread/memory applications), but not necessarily in the manner that you are using the tools. More specifically, we intended for event counters to be the lightweight way to monitor things like GC and Alloc metrics over long periods of time to explicitly avoid turning on GC and Alloc events at the granularity needed to calculate the same values.

Could you expand on a little on your use case to help us get a better picture? e.g., Are you trying to collect "always on" metrics, bursts of metrics in response to something, etc.

baal2000 commented 3 years ago

dotnet-counters is useful in some scenarios but as soon as we detect that there is a higher than normal allocation rate or GC activity, we need to drill deeper to see what and where exactly. This is where dotnet-trace has a promise that has failed to materialize until now, unfortunately.

We mostly use it as an issue troubleshooting tool: when there is a problem with the application, the support simply takes a relatively short (10 s) snapshot of the available .NET Core process events for developers to dig to understand the state of the process during outage. We used to use Perfview for the same purpose quite successfully when running production in the Windows environment.

For one reason or another that does not work as well as used to be with Perfview. Not only because processes outside dotnet are invisible (we understand that is by design) but there is so much more overhead in the process of collecting the data. Allocation events are the most difficult to collect. We have to use a detailed level because information - type collection does not provide much info, if any. The tool can not handle the time limits and gets "carried away" with collecting more that it was asked to. There is no way we can process gigabytes worth of collected data even when we get to that point eventually.

To me this could be a low hanging fruit to pick by supporting similar types of arguments as PerfView does, allowing limiting the events volume while preserving a statistically true picture of the process state.

josalem commented 3 years ago

The tool can not handle the time limits and gets "carried away" with collecting more that it was asked to. There is no way we can process gigabytes worth of collected data even when we get to that point eventually.

I'm betting what's going on is that the amount of data being pushed through + the work your app is already doing is causing the Timer mechanisms to not fire in a timely manner because they aren't getting enough CPU quanta from the task scheduler.

That aside, I reread your comment and I think I finally pieced together what you're talking about when you say dotnetAllocSampled. There is a keyword on the GC provider for getting sampled (high and low) allocation events:

https://github.com/dotnet/runtime/blob/31a389f3ea22ac8365ce9a48b876b5c14a30a06c/src/coreclr/vm/ClrEtwAll.man#L56-L57

and

https://github.com/dotnet/runtime/blob/31a389f3ea22ac8365ce9a48b876b5c14a30a06c/src/coreclr/vm/ClrEtwAll.man#L64-L65

...
<keyword name="GCSampledObjectAllocationHighKeyword" mask="0x200000"
         message="$(string.RuntimePublisher.GCSampledObjectAllocationHighKeywordMessage)" symbol="CLR_GCHEAPALLOCHIGH_KEYWORD"/>
...
<keyword name="GCSampledObjectAllocationLowKeyword" mask="0x2000000"
         message="$(string.RuntimePublisher.GCSampledObjectAllocationLowKeywordMessage)" symbol="CLR_GCHEAPALLOCLOW_KEYWORD"/>
...

I looked through the docs and it looks like that particular keyword is unfortunately not actually documented at https://docs.microsoft.com/en-us/dotnet/framework/performance/clr-etw-keywords-and-levels.

Note that these keywords require that you have them turned on at the start of the process, otherwise you won't get allocation events if you turn them on later. The same is true for ETW. You can do that with dotnet-trace by following the docs: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-trace#use-diagnostic-port-to-collect-a-trace-from-app-startup.

Do you have a list of the information you want to collect for these 10s bursts?

baal2000 commented 3 years ago

These are the profiles we tried to settle on.

1, #2 and #5 usually worked as expected with some exceptions but the rest is not unless the system has very low activity.

CPU-Sampling dotnet-trace collect -p <PID> --buffersize 1024 --duration 00:00:00:10

Providers:

Microsoft-DotNETCore-SampleProfiler
Microsoft-Windows-DotNETRuntime

Information: This command collects information about CPU and its operations. It also collects very basic information about GC operations.

GC-Collect: dotnet-trace collect -p <PID> --profile gc-collect --buffersize 1024 --duration 00:00:00:10

Providers:

Microsoft-Windows-DotNETRuntime

CLR keywords:

GC
Exception

Information: Collects very simple information about GC and the exceptions thrown.

GC-Verbose dotnet-trace collect -p <PID> --profile gc-verbose --buffersize 1024 --duration 00:00:00:10

Providers:

Microsoft-Windows-DotNETRuntime
CLR keywords:

GCHandle
Exception

Information: This profile adds on to the GC-collect profile giving slightly more information about GC operations.

GC dotnet-trace collect -p <PID> --clrevents GC+GCHeapSurvivalAndMovement+Stack+Loader+GCHandle+Type+GCHeapDump+GCHeapCollect+GCHeapAndTypeNames+Exception --clreventlevel 5 --buffersize 1024 --duration 00:00:00:10

Providers:

Microsoft-Windows-DotNETRuntime

CLR keywords:

GC
GCHeapSurvivalAndMovement
Stack
Loader
GCHandle
Type
GCHeapDump
GCHeapCollect
GCHeapAndTypeNames
Exception

Information: This command collects everything about the GC. Comparing this to GC-collect and GC-verbose, the trace size will be larger, but in return there will be more information about the GC.

CPU only dotnet-trace collect -p <PID> --providers Microsoft-DotNETCore-SampleProfiler --buffersize 1024 --duration 00:00:00:10 Information: This command modifies the cpu-sampling profile to just collect information about CPU by using only one provider.
CPU + GC dotnet-trace collect -p <PID> --providers Microsoft-DotNETCore-SampleProfiler --clrevents GC+GCHeapSurvivalAndMovement+Stack+Loader+GCHandle+Type+GCHeapDump+GCHeapCollect+GCHeapAndTypeNames+Exception --clreventlevel 5 --buffersize 1024 --duration 00:00:00:10

Information: This is a combination of CPU only and GC collection commands.

Log Everything (Closest to a PerfView collection trace) dotnet-trace collect -p <PID> --providers '*' --buffersize 1024 --duration 00:00:00:10

Providers:

System.Runtime
Microsoft-Windows-DotNETRuntime
Microsoft-Windows-DotNETRuntimeRundown
Microsoft-DotNETCore-SampleProfiler
Microsoft-DotNETCore-EventPipe
Information:
This command collects information from all providers and their respective events (i.e GC, Threading, JIT, Exceptions, CPU etc). When collecting this trace, the file size will be relatively big compared to the other traces.