dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

How to get stack traces of unresponsive threads? #31508

Closed weltkante closed 4 years ago

weltkante commented 4 years ago

I've been porting a game server from Desktop/Mono to .NET Core. Mostly works great, but there is one component which no longer works. The game server has a watchdog timer thread which will kill the process if one of the event threads takes too long to process a game tick. This is to prevent hanging servers, we'd rather want the server lose the current game transactions and auto restart than hang potentiatlly indefinitely.

On Desktop/Mono the timer thread used Thread.Abort and/or StackTrace(Thread) to make the unresponsive threads stack trace appear in the log for diagnosis.

From what it looks like .NET Core supports neither Thread.Abort nor StackTrace(Thread) constructor. Just having a log entry "terminated due to hanging thread" in the log is obviously a no-go as far as error diagnosis goes, so what is the recommended way to diagnose hanging threads on a live system where you don't have a debugger attached? Note that the process doesn't need to recover, I just want to log the stack of the unresponsive thread before exiting the process.

(Server is running on Windows and Linux in case it matters.)

danmoseley commented 4 years ago

It's not what you're asking for, but there is a convenient debugger API to get stack traces from outside the process

https://github.com/microsoft/clrmd

Eg https://github.com/dotnet/arcade/blob/2c6db6ee8d8adeb2e8ccc1485e6780635890e419/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs#L205

danmoseley commented 4 years ago

@jkotas I guess there is no reasonable way to do this inproc using managed code anymore?

weltkante commented 4 years ago

Thanks for the suggestion and sorry for the late response. In-process would certainly be nicer for the reduced complexity, but if thats not possible I think I can start a secondary process from my watchdog timer and pass the thread id of the hanging thread on the command line.

danmoseley commented 4 years ago

@weltkante that is probably the best approach and it's more flexible too as you can potentially do more things in the future such as examine the heap, create a dump file, get native stacks, etc. The links I pointed to give information about those things, plus this may be useful for createdump: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/xplat-minidump-generation.md

I'll close this, but please post back how you get on, especially if yo'ure willing to share sample code for what you do.

weltkante commented 4 years ago

@danmosemsft I'm trying to implement your suggestion but it doesn't seem to work, I'm just getting an exception:

Unhandled Exception: System.Runtime.InteropServices.MarshalDirectiveException: Cannot marshal 'parameter dotnet/corefx#2': Invalid managed/unmanaged type combination (Marshaling to and from COM interface pointers isn't supported).
   at Microsoft.Diagnostics.Runtime.DbgEngDataReader.DebugCreate(Guid& InterfaceId, Object& Interface)
   at Microsoft.Diagnostics.Runtime.DbgEngDataReader.CreateIDebugClient()
   at Microsoft.Diagnostics.Runtime.DbgEngDataReader..ctor(Int32 pid, AttachFlag flags, UInt32 msecTimeout)
   at Microsoft.Diagnostics.Runtime.DataTarget.AttachToProcess(Int32 pid, UInt32 msecTimeout, AttachFlag attachFlag)
   at Program.Main(String[] args) in /home/xxx/private/build/guard/Program.cs:line 21

Here's the test program I've written, I'm passing the process and thread id manually for now:

private static void Main(string[] args)
{
    var pid = Int32.Parse(args[0], NumberStyles.None, CultureInfo.InvariantCulture);
    var tid = Int32.Parse(args[1], NumberStyles.None, CultureInfo.InvariantCulture);

    using (var dt = DataTarget.AttachToProcess(pid, 1000))
    {
        var rt = dt.ClrVersions.Single().CreateRuntime();
        var thread = rt.Threads.Single(x => x.OSThreadId == tid);
        var sb = new StringBuilder();
        foreach (var frame in thread.StackTrace)
            sb.AppendLine($"\t{frame}");
        Console.Error.Write(sb.ToString());
    }
}

Are you sure this is a supported scenario of this library? The linked code is explicitely checking for Windows.

danmoseley commented 4 years ago

@leculver , clrmd works on Linux right? Or only partially?

leculver commented 4 years ago

@danmosemsft It does support linux, but only certain flags. This should be throwing a better exception...

In this case you need to pass a specific flag because ClrMD doesn't (yet) support a regular debugger attach on linux: using (var dt = DataTarget.AttachToProcess(pid, 1000, AttachFlag.Passive)). This flag means that we will not pause the process or connect a real debugger to it. We have only opened the process to inspect memory. For a hang this should work just fine.

Alternatively you can use dotnet-dump to create a core dump, then instead of DataTarget.AttachToProcess you use DataTarget.LoadCoreDump. This will be the best to ensure the process isn't changing while you are inspecting it, but a little more complicated to get set up.

I will fix this to throw a more helpful exception later today.

weltkante commented 4 years ago

Ah, using that flag I'm getting further, now having to sort out security/access rights.

For a hang this should work just fine.

@leculver It's not necessarily a hang in the sense of a deadlock though, it can be a "livelock" as well, where the thread is doing useless stuff and never returning to the main event loop.

Is it save to iterate over the stackframes as shown above (will it snapshot) or can that produce a torn stacktrace? I'd rather not want to risk having torn stacktraces in my log because deadlocks are the exception and a thread stuck in a bad loop is the more common thing for us. Unfortunately the whole Thread.Suspend story seems also being deprecated, otherwise I would've just suspended the unresponsive thread before starting the helper program dumping the stacktrace.

leculver commented 4 years ago

A live lock may produce an odd/torn stacktrace. It's still on my long-term todo list to get real process attach working, but I won't get to it this month. Your best bet is to snap a coredump using dotnet-dump and inspecting that coredump.

weltkante commented 4 years ago

Thanks for the advice, I guess I'll have to try making a dump then, was hoping I could avoid file management. This issue probably can be closed and I'll be asking over at clrmd repo if I should run into problems.

It's still on my long-term todo list to get real process attach working

Any tracking issue I can subscribe to, or do you mean its just on your roadmap?

leculver commented 4 years ago

It's just on my roadmap. Feel free to create an issue in the clrmd github repo though if you'd like to track it.