dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.18k stars 356 forks source link

TimeoutException with `dotnet-trace` against misbehaving .NET Core process #1261

Open davidni opened 4 years ago

davidni commented 4 years ago

Context

NOTE: This is related to https://github.com/dotnet/diagnostics/issues/1260. Both issues repro'd consistently against the same PID.

I encountered this while analyzing perf of a .NET Core 3.1 process (ASP .NET Core 3.1 web app). The target was built with .NET Core SDK 3.1.301, and is deployed as a self-contained exe. The issue happened on a Windows Server 2016 Datacenter machine, OS Version 1607, OS Build 14393.3750.

Setup

I collected a few traces using dotnet-trace collect, but the last trace I attempted to collect broke during teardown -- dotnet-trace reported an exception after I pressed Ctrl+C. I unfortunately did not capture that exception, because other commands I ran after cleared my console window (dotnet-trace, dotnet-counters, I am looking at you....).

Issue

It seems like my .NET Core process entered a bad state after my botched dotnet-trace collect session. Although my process continued to work without issues in that state (still serving HTTP requests), dotnet-trace and dotnet-counters could no longer connect to it.

See console output below for dotnet-trace. This error repro'd consistently after my process entered this weird state.

> dotnet trace collect -p 5584 --buffersize 512 -o 2_default.nettrace
No profile or providers specified, defaulting to trace profile 'cpu-sampling'

Provider Name                           Keywords            Level               Enabled By
Microsoft-DotNETCore-SampleProfiler     0x0000000000000000  Informational(4)    --profile
Microsoft-Windows-DotNETRuntime         0x00000014C14FCCBD  Informational(4)    --profile

[ERROR] System.TimeoutException: The operation has timed out.
   at System.IO.Pipes.NamedPipeClientStream.ConnectInternal(Int32 timeout, CancellationToken cancellationToken, Int32 startTime)
   at System.IO.Pipes.NamedPipeClientStream.Connect(Int32 timeout)
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.GetTransport(Int32 processId) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 51
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessage(Int32 processId, IpcMessage message, IpcMessage& response) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 104
   at Microsoft.Diagnostics.NETCore.Client.EventPipeSession..ctor(Int32 processId, IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/EventPipeSession.cs:line 30
   at Microsoft.Diagnostics.Tools.Trace.CollectCommandHandler.Collect(CancellationToken ct, IConsole console, Int32 processId, FileInfo output, UInt32 buffersize, String providers, String profile, TraceFileFormat format, TimeSpan duration, String clrevents, String clreventlevel) in /_/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs:line 130

My tools manifest:

{
  "version": 1,
  "isRoot": true,
  "tools": {
    "dotnet-trace": {
      "version": "3.1.120604",
      "commands": [
        "dotnet-trace"
      ]
    },
    "dotnet-counters": {
      "version": "3.1.120604",
      "commands": [
        "dotnet-counters"
      ]
    }
  }
}
davidni commented 4 years ago

Possibly relevant bit of info: the target process was heavily throttled during the data collection session. It had a Job Object assigned, limiting its CPU use to approximtely 0.25 cores, which may have played a role in triggering the issue. I also observed dropped events when using PerfView to open one of the (successful) traces that I collected before the issue started to happen.

@noahfalk (as suggested by @karelz) Thanks!

sywhang commented 4 years ago

@davidni Thanks for the detailed report!

Do you happen to know if this issue happens when you remove the Job Object or if you increase the CPU usage limit to something like 1 core?

davidni commented 4 years ago

@sywhang yes, it is very likely that the job limit was bumped up during the data collection that broke. I don't have logs to confirm, but I should be able to trigger the same conditions again.

ghost commented 4 years ago

Fyi, there's no "remove the job object" for a running Windows process, at least not from user space. I'm guessing you mean re-running without attaching a job object?

Note that David and I are normally running this code on Service Fabric, where there will always be a job object attached (though we can run it with no CPU limit)