Open kevingosse opened 2 years ago
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | kevingosse |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr` |
Milestone: | - |
@davmason
Thanks for reporting this, Kevin. I took a look and I don't think this behavior was intentional - there shouldn't be anything in the profiler code that prevents callbacks even after g_fEEShutDown is set, but we bail from Thread::OnThreadTerminate
before we call the profiler back.
I can't think of a clear right thing to do here, if we move the callback to before that check then I can easily imagine some other profiler would start crashing because they implicitly rely on the callback not happening when the runtime is shutting down.
My preference would be to not fix this unless there is a clear and specific need. If we do want to fix it, we definitely shouldn't do it in 7 - it is far too late and we risk destabilizing right before shipping.
Moving to future for now, but happy to continue the conversation about whether or not we fix it in 8
I can't think of a clear right thing to do here, if we move the callback to before that check then I can easily imagine some other profiler would start crashing because they implicitly rely on the callback not happening when the runtime is shutting down.
I doubt so. As far as I can tell, there is no synchronization, so there is always a chance for callbacks being executed slightly after shutdown. I expect most profilers to be resilient to that (at least, we had to make sure ours was).
My preference would be to not fix this unless there is a clear and specific need.
As a workaround I'm now checking that the thread handle is still valid right before calling DoStackSnapshot
. It doesn't entirely remove the race condition, but I'm hoping it makes it unlikely enough. We'll see 🙂 Honestly, I think the main issue isn't that ThreadDestroyed
isn't called, but rather that DoStackSnapshot
tears down the process if the handle is invalid.
Description
We noticed some crashes with our profiler during calls to
DoStackSnapshot
, with an error code0x8013135E
(CORPROF_E_STACKSNAPSHOT_INVALID_TGT_THREAD
). Upon further inspection, it seems to happen when the thread dies beforeDoStackSnapshot
has time to suspend it.This article explains that we should block the calls to
ICorProfilerCallback::ThreadDestroyed
while walking the thread to prevent that from happening. Unfortunately, that's already what we're going.The issue is that the EE shutdown sequence starts by setting
g_fEEShutDown
toShutDown_Start
, then run a bunch of step, before callingICorProfilerInfo::Shutdown
, notifying the profiler that is should stop trying to walk threads. On the other hand, when a thread dies, it checks ifg_fEEShutDown != 0
, and callsThreadDestroyed
only if that's not the case.It means that there is a window of time, between the moment when
g_fEEShutDown
is set toShutDown_Start
and the moment whenICorProfilerInfo::Shutdown
is called, during which the profiler isn't notified when a thread dies.The issue is not critical because the crash happens at shutdown, but it's still an inconvenience. I don't know what's the best way to fix it, but I believe
ThreadDestroyed
should keep being called untilg_fEEShutDown
is set toShutDown_Profiler
. OrDoStackSnapshot
shouldn't tear down the process when it receives an invalid thread handle: https://github.com/dotnet/runtime/blob/05473449c1db9edbbbc565b39d73a59bf517de96/src/coreclr/vm/proftoeeinterfaceimpl.cpp#L8493Reproduction Steps
Hard to reproduce since it's a race condition, but basically have a profiler continuously call
DoStackSnapshot
during shutdown.Expected behavior
ICorProfilerInfo::ThreadDestroyed
should be called for the dying thread, orDoStackSnapshot
shouldn't tear down the process.Actual behavior
DoStackSnapshot
tears down the process with an error codeCORPROF_E_STACKSNAPSHOT_INVALID_TGT_THREAD
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response