Author: | nike4613 |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr`, `untriaged` |
Milestone: | - |
Open nike4613 opened 1 year ago
The PAL_DispatchExceptionWrapper
presence on the stack means that there was a hardware exception at the managed frame below that. In this case, based on the presence of DebuggerController::DispatchPatchOrSingleStep
on the stack, it was a breakpoint coming from the single stepping in the debugger.
However, it is quite strange that we would hang waiting for thread suspension (GC) when there is no other managed thread than the finalizer thread.
@nike4613 when you say VS's remote debugger, do you mean VS on Windows or VS on macOS?
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | nike4613 |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr`, `untriaged` |
Milestone: | - |
However, it is quite strange that we would hang waiting for thread suspension (GC) when there is no other managed thread than the finalizer thread.
Note that the program hacks into the runtime by inserting detour implemented in managed code between VM and JIT. My guess is that this detour corrupts VM state and that leads to hangs and crashes. We have special transitions (JIT_TO_EE_TRANSITION
macros and friends) to transition between JIT and VM. The detour will re-enter the VM without proper JIT_TO_EE_TRANSITION
transition.
@nike4613 The detour like what you install is not something we support. You may want to build a checked flavor of the runtime - it may give you more clues about what went wrong.
@nike4613 when you say VS's remote debugger, do you mean VS on Windows or VS on macOS?
I mean VS on Windows.
Note that the program hacks into the runtime by inserting detour implemented in managed code between VM and JIT. My guess is that this detour corrupts VM state and that leads to hangs and crashes. We have special transitions (
JIT_TO_EE_TRANSITION
macros and friends) to transition between JIT and VM. The detour will re-enter the VM without properJIT_TO_EE_TRANSITION
transition.
What is special about the JIT<->EE transitions? Is there any way I could force the correct transition to happen?
@nike4613 The detour like what you install is not something we support.
I am largely aware, though as I said, I've never seen this issue before, and haven't been able to reproduce since; this JIT hook works in production across multiple runtime versions right now. As far as I'm aware, there isn't another option to make this functionality work, because we need to know when a method gets recompiled, and be able to install our detour to it before the new method is published.
You may want to build a checked flavor of the runtime - it may give you more clues about what went wrong.
I'll do that, but I don't know what, if anything, I'd find, given that I can't reproduce this issue in the first place.
Is there any way I could force the correct transition to happen?
No. These transitions are internal implementation detail of the runtime. There is no way to do them outside the runtime.
As far as I'm aware, there isn't another option to make this functionality work, because we need to know when a method gets recompiled
Profiler APIs (https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/icorprofilerinfo-setilfunctionbody-method and friends) are the supported way to instrument methods.
Is there any way I could force the correct transition to happen?
No. These transitions are internal implementation detail of the runtime. There is no way to do them outside the runtime.
Good thing we already rely on several implementation details, according to the detected runtime version.
As far as I'm aware, there isn't another option to make this functionality work, because we need to know when a method gets recompiled
Profiler APIs (https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/icorprofilerinfo-setilfunctionbody-method and friends) are the supported way to instrument methods.
Are the profiler APIs available on all platforms? How would one attach to a process on, say Linux? And more than that, how could we do that in-process from managed?
The most common use of MonoMod is game modding, targeting either Unity Mono, whatever version of CoreCLR BepInEx loads for IL2CPP, or whatever runtime an XNA/FNA/MonoGame game is sitting on top of. We already have a great deal of logic to select low-level implementations based on the current runtime, and abuse implementation details for each of them. MonoMod is also used in several places where it must be able to be loaded at some arbitrary point after process start and function, and must be able to use the same binary for different runtimes/architectures/operating systems. It also shouldn't interfere with external tooling, though that's more acceptable as long as it doesn't interfere with debuggers.
Anything that doesn't meet these requirements are non-starters for us, which is why we use the approach we do. If you want, we can discuss this question more on the C# discord, but that's largely irrelevant to this issue (even if our approach is potentially related).
Profiler APIs are available on all platforms. The profilers have to be written in native code. It is not possible to write fully managed profilers.
There is no supported way to do what you are trying to do today.
Also, for these type of issues we'd likely need a repro - half of the state under the debugger is on the debugger side. I can only see that you hit a hardware exception that the debugger needs to handle and most threads are trying to toggle thread state. However, since you introduced a managed method in between the interfaces as a detour all toggles are potentially off... It becomes a timing game. Anything that sits in between these two layers doesn't have a way to signal components of changes and is unsupported. Even profiling can mess with the debugger, but something as low as messing with the thread state is even more likely to cause runtime issues. Also, since profilers are native you can't use the same DLL.
Also, for these type of issues we'd likely need a repro
That's understandable. I'll see if I can find a reliable way to repro.
We'd also be OK with somehow disabling step-in to the JIT hook, though I don't know of a good way to do that.
You'd still need a step through which will place breakpoints there even if you don't notice them. Placing a breakpoint at the other side is the only non-stepping solution. However, if the hook does anything the debugger triggers on (thread creation for some bg processing, some type loading, evaluation of properties) you'll likely end up in the same place. Running managed code in suspended states is a place where if the runtime is not cooperating, you'll likely end up in states that we can't guarantee the correctness of.
Quick update-- I am able to successfully run under a checked runtime (release/6.0) with no issues, JIT hook and all. Now to try to reproduce this issue...
I've figured out how to reproduce the deadlock under a checked build, though it seems to be happening in a slightly different place now. I have a full process dump of the application in this state, before it died due to asking LLDB for a thread backtrace... I'm also not able to get a managed backtrace from SOS while debugging the dump, but maybe I'm just doing something wrong there. I had enabled the CLR log with DOTNET_LogFacility=20010000
as well, if that may be helpful.
I'll try to upload the dump somewhere in a minute; it's about 3.6GB uncompressed.
E: My reproduction steps are as follows:
src/MonoMod.Core/Platforms/Architectures/AltEntryFactories/IcedAltEntryFactory.cs
Oh, not really related, but I did notice when I tried to step-in once, I hit a CLR assert in debugging support code: https://github.com/dotnet/runtime/blob/19fde2f5b9dd7c8b5f37e9f02688ff9b708b24b5/src/coreclr/debug/ee/controller.cpp#L3917
Dump is here. It is zstd compressed to get the size down.
It is a checked build of 19fde2f5b9dd7c8b5f37e9f02688ff9b708b24b5 on MacOS Monterey x64, built with ./build.sh clr+libs -rc Checked -lc Release
and run using corerun
. Let me know if you need the DAC build for it.
Description
While using VS's remote debugger to debug MonoMod on .NET 6 on MacOS x64, I ran into a deadlock while trying to step into a native function which I had hooked back to managed code (using
Marshal.GetFunctionPointerForDelegate
to create the target of the detour). I have not been able to reproduce this, however I did poke around in LLDB, so have some information.MonoMod installs a JIT hook in order to track method recompilations, and this has worked very well for us so far. This is the first time we have seen this, despite having used this JIT hook, mostly unchanged, for at least a year.
The application which I saw this happen on was
MonoMod.FrameworkTests
, which locked when trying to step into the call tomsvcrand()
on line 67.Answers to some things Tanner asked when I asked in the C# discord:
SuppressGCTransition
isn't used anywhere in this projectMarshal.GetFunctionPointerForDelegate
or its inverseReproduction Steps
MonoMod.FrameworkTests
on MacOS .NET 6Program.cs
msvcrand()
Expected behavior
No deadlock.
Actual behavior
A deadlock occurs.
At the time of the deadlock,. there are 11 threads, though only 3 of them are managed threads known by the runtime.
I believe most of the CLR-unknown threads are OS-created threads to handle Mach IPC, debugger related, or other similar threads.
Main thread native stack:
Main thread managed stack:
Finalizer thread stack:
Tiered Compilation background thread native stack:
Tiered Compilation background thread managed stack:
Regression?
Unknown; I can't reproduce even on the same runtime.
Known Workarounds
No response
Configuration
Other information
Based on the stack of the two managed threads, I can guess that there's a race somewhere in
Thread::RareEnablePreemptiveGC
or similar, though I am not at all familiar with that method or anything around it. I am also guessing that thePAL_DispatchExceptionWrapper
is called somehow for the debugger to break when it enters the JIT hook method, but again, I am not at all familiar.