ICorProfilerCallback:ThreadDestroyed race condition on shutdown

kevingosse commented 2 years ago

Description

We noticed some crashes with our profiler during calls to DoStackSnapshot, with an error code 0x8013135E (CORPROF_E_STACKSNAPSHOT_INVALID_TGT_THREAD). Upon further inspection, it seems to happen when the thread dies before DoStackSnapshot has time to suspend it.

This article explains that we should block the calls to ICorProfilerCallback::ThreadDestroyed while walking the thread to prevent that from happening. Unfortunately, that's already what we're going.

The issue is that the EE shutdown sequence starts by setting g_fEEShutDown to ShutDown_Start, then run a bunch of step, before calling ICorProfilerInfo::Shutdown, notifying the profiler that is should stop trying to walk threads. On the other hand, when a thread dies, it checks if g_fEEShutDown != 0, and calls ThreadDestroyed only if that's not the case.

It means that there is a window of time, between the moment when g_fEEShutDown is set to ShutDown_Start and the moment when ICorProfilerInfo::Shutdown is called, during which the profiler isn't notified when a thread dies.

The issue is not critical because the crash happens at shutdown, but it's still an inconvenience. I don't know what's the best way to fix it, but I believe ThreadDestroyed should keep being called until g_fEEShutDown is set to ShutDown_Profiler. Or DoStackSnapshot shouldn't tear down the process when it receives an invalid thread handle: https://github.com/dotnet/runtime/blob/05473449c1db9edbbbc565b39d73a59bf517de96/src/coreclr/vm/proftoeeinterfaceimpl.cpp#L8493

Reproduction Steps

Hard to reproduce since it's a race condition, but basically have a profiler continuously call DoStackSnapshot during shutdown.

Expected behavior

ICorProfilerInfo::ThreadDestroyed should be called for the dying thread, or DoStackSnapshot shouldn't tear down the process.

Actual behavior

DoStackSnapshot tears down the process with an error code CORPROF_E_STACKSNAPSHOT_INVALID_TGT_THREAD

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

ghost commented 2 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details

### Description We noticed some crashes with our profiler during calls to `DoStackSnapshot`, with an error code `0x8013135E` (`CORPROF_E_STACKSNAPSHOT_INVALID_TGT_THREAD`). Upon further inspection, it seems to happen when the thread dies before `DoStackSnapshot` has time to suspend it. [This article](https://github.com/dotnet/runtime/blob/05473449c1db9edbbbc565b39d73a59bf517de96/docs/design/coreclr/profiling/davbr-blog-archive/Profiler%20stack%20walking%20Basics%20and%20beyond.md) explains that we should block the calls to `ICorProfilerCallback::ThreadDestroyed` while walking the thread to prevent that from happening. Unfortunately, that's already what we're going. The issue is that the EE shutdown sequence [starts by setting `g_fEEShutDown` to `ShutDown_Start`,](https://github.com/dotnet/runtime/blob/05473449c1db9edbbbc565b39d73a59bf517de96/src/coreclr/vm/ceemain.cpp#L1214) then run a bunch of step, before [calling `ICorProfilerInfo::Shutdown`](https://github.com/dotnet/runtime/blob/05473449c1db9edbbbc565b39d73a59bf517de96/src/coreclr/vm/ceemain.cpp#L1290), notifying the profiler that is should stop trying to walk threads. On the other hand, when a thread dies, [it checks if `g_fEEShutDown != 0`](https://github.com/dotnet/runtime/blob/2d41986530818cc494888905dd2b6dcfedfbcc18/src/coreclr/vm/threads.cpp#L2975), and [calls `ThreadDestroyed` ](https://github.com/dotnet/runtime/blob/2d41986530818cc494888905dd2b6dcfedfbcc18/src/coreclr/vm/threads.cpp#L3058)only if that's not the case. It means that there is a window of time, between the moment when `g_fEEShutDown` is set to `ShutDown_Start` and the moment when `ICorProfilerInfo::Shutdown` is called, during which the profiler isn't notified when a thread dies. The issue is not critical because the crash happens at shutdown, but it's still an inconvenience. I don't know what's the best way to fix it, but I believe `ThreadDestroyed` should keep being called until `g_fEEShutDown` is set to `ShutDown_Profiler`. Or `DoStackSnapshot` shouldn't tear down the process when it receives an invalid thread handle (https://github.com/dotnet/runtime/blob/05473449c1db9edbbbc565b39d73a59bf517de96/src/coreclr/vm/proftoeeinterfaceimpl.cpp#L8493). ### Reproduction Steps Hard to reproduce since it's a race condition, but basically have a profiler continuously call `DoStackSnapshot` during shutdown. ### Expected behavior `ICorProfilerInfo::ThreadDestroyed` should be called for the dying thread, **or** `DoStackSnapshot` shouldn't tear down the process. ### Actual behavior `DoStackSnapshot` tears down the process with an error code `CORPROF_E_STACKSNAPSHOT_INVALID_TGT_THREAD` ### Regression? _No response_ ### Known Workarounds _No response_ ### Configuration _No response_ ### Other information _No response_

Author:	kevingosse
Assignees:	-
Labels:	`area-Diagnostics-coreclr`
Milestone:	-

teo-tsirpanis commented 2 years ago

@davmason

davmason commented 2 years ago

Thanks for reporting this, Kevin. I took a look and I don't think this behavior was intentional - there shouldn't be anything in the profiler code that prevents callbacks even after g_fEEShutDown is set, but we bail from Thread::OnThreadTerminate before we call the profiler back.

I can't think of a clear right thing to do here, if we move the callback to before that check then I can easily imagine some other profiler would start crashing because they implicitly rely on the callback not happening when the runtime is shutting down.

My preference would be to not fix this unless there is a clear and specific need. If we do want to fix it, we definitely shouldn't do it in 7 - it is far too late and we risk destabilizing right before shipping.

Moving to future for now, but happy to continue the conversation about whether or not we fix it in 8

kevingosse commented 2 years ago

I can't think of a clear right thing to do here, if we move the callback to before that check then I can easily imagine some other profiler would start crashing because they implicitly rely on the callback not happening when the runtime is shutting down.

I doubt so. As far as I can tell, there is no synchronization, so there is always a chance for callbacks being executed slightly after shutdown. I expect most profilers to be resilient to that (at least, we had to make sure ours was).

My preference would be to not fix this unless there is a clear and specific need.

As a workaround I'm now checking that the thread handle is still valid right before calling DoStackSnapshot. It doesn't entirely remove the race condition, but I'm hoping it makes it unlikely enough. We'll see 🙂 Honestly, I think the main issue isn't that ThreadDestroyed isn't called, but rather that DoStackSnapshot tears down the process if the handle is invalid.

dotnet / runtime