dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.07k stars 4.69k forks source link

Potential deadlock during GC transition #82283

Open nike4613 opened 1 year ago

nike4613 commented 1 year ago

Description

While using VS's remote debugger to debug MonoMod on .NET 6 on MacOS x64, I ran into a deadlock while trying to step into a native function which I had hooked back to managed code (using Marshal.GetFunctionPointerForDelegate to create the target of the detour). I have not been able to reproduce this, however I did poke around in LLDB, so have some information.

MonoMod installs a JIT hook in order to track method recompilations, and this has worked very well for us so far. This is the first time we have seen this, despite having used this JIT hook, mostly unchanged, for at least a year.

The application which I saw this happen on was MonoMod.FrameworkTests, which locked when trying to step into the call to msvcrand() on line 67.

Answers to some things Tanner asked when I asked in the C# discord:

Reproduction Steps

  1. Start MonoMod.FrameworkTests on MacOS .NET 6
  2. Attach VS remote debugger when prompted
  3. Set a breakpoint on line 67 of Program.cs
  4. Resume the application by pressing enter
  5. Attempt to step in to msvcrand()
  6. Program hangs, or rather, did once.

Expected behavior

No deadlock.

Actual behavior

A deadlock occurs.

At the time of the deadlock,. there are 11 threads, though only 3 of them are managed threads known by the runtime.

(lldb) thread list
Process 24949 stopped
* thread #1: tid = 0xa03c3, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  thread #2: tid = 0xa03cb, 0x00007ff80366c97a libsystem_kernel.dylib`mach_msg_trap + 10
  thread #3: tid = 0xa03cc, 0x00007ff80367134e libsystem_kernel.dylib`kevent + 10
  thread #4: tid = 0xa03cd, 0x00007ff8036730aa libsystem_kernel.dylib`poll + 10
  thread #5: tid = 0xa03ce, 0x00007ff80366d3ba libsystem_kernel.dylib`read + 10
  thread #6: tid = 0xa03cf, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #7: tid = 0xa03d0, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #8: tid = 0xa03d3, 0x00007ff80366e05a libsystem_kernel.dylib`__workq_kernreturn + 10
  thread #9: tid = 0xa03d5, 0x00007ff80366d3ba libsystem_kernel.dylib`read + 10
  thread #10: tid = 0xa04c7, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #11: tid = 0xa04e5, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
(lldb) clrthreads
ThreadCount:      3
UnstartedThread:  0
BackgroundThread: 2
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1    a03c3 00007FF30F813E00    a0028 Preemptive  0000000186B0E8C0:0000000186B0FFD0 00007FF31080C200 -00001 Ukn
   7    2    a03d0 00007FF30F81A800    21228 Preemptive  00000001869AAE78:00000001869ACAA0 00007FF31080C200 -00001 Ukn (Finalizer)
  11    3    a04e5 00007FF310830000    a1228 Preemptive  0000000186B17E60:0000000186B17FD0 00007FF31080C200 -00001 Ukn

I believe most of the CLR-unknown threads are OS-created threads to handle Mach IPC, debugger related, or other similar threads.

Main thread native stack:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8036a9a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00000001068f0162 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 322
    frame #3: 0x00000001068efd8a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 362
    frame #4: 0x00000001068f4609 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2153
    frame #5: 0x00000001068f47dd libcoreclr.dylib`WaitForSingleObjectEx + 77
    frame #6: 0x0000000106b0f1bf libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 191
    frame #7: 0x0000000106b158d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper() + 192
    frame #8: 0x0000000106b142c8 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() + 296
    frame #9: 0x0000000106b13c35 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() + 277
    frame #10: 0x0000000106c1f4d1 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) + 913
    frame #11: 0x0000000106c2064f libcoreclr.dylib`DebuggerController::DispatchNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*) + 511
    frame #12: 0x0000000106c2da69 libcoreclr.dylib`Debugger::FirstChanceNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*) + 89
    frame #13: 0x0000000106b3e0c6 libcoreclr.dylib`HandleHardwareException(PAL_SEHException*) + 454
    frame #14: 0x00000001068c4a9b libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 315
    frame #15: 0x00000001068fe719 libcoreclr.dylib`PAL_DispatchException + 137
    frame #16: 0x00000001068fe3a3 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #17: 0x0000000115fe13e6
    frame #18: 0x0000000115fe12e5
    frame #19: 0x0000000106999222 libcoreclr.dylib`invokeCompileMethodHelper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 354
    frame #20: 0x00000001069992f2 libcoreclr.dylib`invokeCompileMethod(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 130
    frame #21: 0x0000000106999cf5 libcoreclr.dylib`UnsafeJitFunction(PrepareCodeConfig*, COR_ILMETHOD_DECODER*, CORJIT_FLAGS, unsigned int*) + 1317
    frame #22: 0x00000001069d1fc0 libcoreclr.dylib`MethodDesc::JitCompileCodeLocked(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*, unsigned int*, CORJIT_FLAGS*) + 288
    frame #23: 0x00000001069d1ac5 libcoreclr.dylib`MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*) + 613
    frame #24: 0x00000001069d122f libcoreclr.dylib`MethodDesc::JitCompileCode(PrepareCodeConfig*) + 479
    frame #25: 0x00000001069d0d9a libcoreclr.dylib`MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 634
    frame #26: 0x00000001069d3ed1 libcoreclr.dylib`MethodDesc::DoPrestub(MethodTable*, CallerGCMode) + 545
    frame #27: 0x00000001069d3857 libcoreclr.dylib`PreStubWorker + 535
    frame #28: 0x0000000106be721b libcoreclr.dylib`ThePreStub + 92
    frame #29: 0x000000011601cd53
    frame #30: 0x000000011601ccaf
    frame #31: 0x0000000115861eb7
    frame #32: 0x0000000106be65e9 libcoreclr.dylib`CallDescrWorkerInternal + 124
    frame #33: 0x0000000106a329ea libcoreclr.dylib`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1530
    frame #34: 0x000000010691893e libcoreclr.dylib`RunMain(MethodDesc*, short, int*, PtrArray**) + 734
    frame #35: 0x0000000106918c5b libcoreclr.dylib`Assembly::ExecuteMainMethod(PtrArray**, int) + 395
    frame #36: 0x000000010694a792 libcoreclr.dylib`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 498
    frame #37: 0x00000001069024e4 libcoreclr.dylib`coreclr_execute_assembly + 180
    frame #38: 0x000000010618aafd libhostpolicy.dylib`run_app_for_context(hostpolicy_context_t const&, int, char const**) + 1293
    frame #39: 0x000000010618b961 libhostpolicy.dylib`corehost_main + 241
    frame #40: 0x0000000106113f15 libhostfxr.dylib`fx_muxer_t::handle_exec_host_command(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, host_startup_info_t const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::unordered_map<known_options, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >, known_options_hash, std::__1::equal_to<known_options>, std::__1::allocator<std::__1::pair<known_options const, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) + 1621
    frame #41: 0x0000000106112d1d libhostfxr.dylib`fx_muxer_t::execute(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, int, char const**, host_startup_info_t const&, char*, int, int*) + 621
    frame #42: 0x000000010610f6b8 libhostfxr.dylib`hostfxr_main_startupinfo + 152
    frame #43: 0x0000000105f5b854 MonoMod.FrameworkTests`exe_start(int, char const**) + 1508
    frame #44: 0x0000000105f5ba60 MonoMod.FrameworkTests`main + 144
    frame #45: 0x000000010e4d052e dyld`start + 462

Main thread managed stack:

(lldb) clrstack
OS Thread Id: 0xa03c3 (1)
        Child SP               IP Call Site
00007FF7B9FAF9E0 0000000115FE13E5 MonoMod.Core.Platforms.Runtimes.Core21Runtime+JitHookDelegateHolder.CompileMethodHook(IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) [/Users/aaron/Documents/MonoMod/src/MonoMod.Core/Platforms/Runtimes/Core21Runtime.cs @ 154]
00007FF7B9FAFC80 0000000115FE12E5 ILStubClass.IL_STUB_ReversePInvoke(Int64, Int64, IntPtr, UInt32, IntPtr, IntPtr)
00007FF7B9FB0460 0000000106999222 [PrestubMethodFrame: 00007ff7b9fb0460] DynamicClass.SyncProxy<native->managed Get1Delegate>(NativeDetourSyncInfo)
00007FF7B9FB05F0 000000011601CD53 ILStubClass.IL_STUB_ReversePInvoke()
00007FF7B9FB0640 000000011601ccaf [InlinedCallFrame: 00007ff7b9fb0640]
00007FF7B9FB0630 000000011601CCAF ILStubClass.IL_STUB_PInvoke()
00007FF7B9FB06C0 0000000115861EB7 Program.<Main>$(System.String[]) [/Users/aaron/Documents/MonoMod/src/MonoMod.FrameworkTests/Program.cs @ 67]

Finalizer thread stack:

(lldb) bt
* thread #7
  * frame #0: 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8036a9a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00000001068f0162 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 322
    frame #3: 0x00000001068efd8a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 362
    frame #4: 0x00000001068f4609 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2153
    frame #5: 0x00000001068f48a2 libcoreclr.dylib`WaitForMultipleObjectsEx + 82
    frame #6: 0x0000000106a77548 libcoreclr.dylib`FinalizerThread::WaitForFinalizerEvent(CLREvent*) + 200
    frame #7: 0x0000000106a77664 libcoreclr.dylib`FinalizerThread::FinalizerThreadWorker(void*) + 180
    frame #8: 0x00000001069fa878 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 296
    frame #9: 0x00000001069fae40 libcoreclr.dylib`ManagedThreadBase::FinalizerBase(void (*)(void*)) + 32
    frame #10: 0x0000000106a777d8 libcoreclr.dylib`FinalizerThread::FinalizerThreadStart(void*) + 88
    frame #11: 0x00000001068fc4e7 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(void*) + 407
    frame #12: 0x00007ff8036a94e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #13: 0x00007ff8036a4f6b libsystem_pthread.dylib`thread_start + 15

Tiered Compilation background thread native stack:

(lldb) bt
* thread #11
  * frame #0: 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8036a9a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00000001068f0162 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 322
    frame #3: 0x00000001068efd8a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 362
    frame #4: 0x00000001068f4609 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2153
    frame #5: 0x00000001068f47dd libcoreclr.dylib`WaitForSingleObjectEx + 77
    frame #6: 0x0000000106b0f1bf libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 191
    frame #7: 0x0000000106b158d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper() + 192
    frame #8: 0x0000000106b142c8 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() + 296
    frame #9: 0x0000000106b13c35 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() + 277
    frame #10: 0x0000000106982038 libcoreclr.dylib`JITInlineTrackingMap::AddInliningDontTakeLock(MethodDesc*, MethodDesc*) + 72
    frame #11: 0x0000000106981fb1 libcoreclr.dylib`JITInlineTrackingMap::AddInlining(MethodDesc*, MethodDesc*) + 49
    frame #12: 0x0000000106992e91 libcoreclr.dylib`CEEInfo::reportInliningDecision(CORINFO_METHOD_STRUCT_*, CORINFO_METHOD_STRUCT_*, CorInfoInline, char const*) + 1761
    frame #13: 0x00000001072a3dcb libclrjit.dylib`Compiler::fgInline() + 395
    frame #14: 0x00000001073837a8 libclrjit.dylib`Phase::Run() + 24
    frame #15: 0x00000001072868ef libclrjit.dylib`Compiler::compCompile(void**, unsigned int*, JitFlags*) + 863
    frame #16: 0x000000010728828c libclrjit.dylib`Compiler::compCompileHelper(CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, JitFlags*) + 796
    frame #17: 0x0000000107287bd4 libclrjit.dylib`Compiler::compCompile(CORINFO_MODULE_STRUCT_*, void**, unsigned int*, JitFlags*) + 900
    frame #18: 0x0000000107288e9b libclrjit.dylib`jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, JitFlags*, void*) + 539
    frame #19: 0x000000010728d139 libclrjit.dylib`CILJit::compileMethod(ICorJitInfo*, CORINFO_METHOD_INFO*, unsigned int, unsigned char**, unsigned int*) + 105
    frame #20: 0x0000000115fe120c
    frame #21: 0x0000000115fe0616
    frame #22: 0x0000000115fe14d6
    frame #23: 0x0000000115fe12e5
    frame #24: 0x0000000106999222 libcoreclr.dylib`invokeCompileMethodHelper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 354
    frame #25: 0x00000001069992f2 libcoreclr.dylib`invokeCompileMethod(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 130
    frame #26: 0x0000000106999cf5 libcoreclr.dylib`UnsafeJitFunction(PrepareCodeConfig*, COR_ILMETHOD_DECODER*, CORJIT_FLAGS, unsigned int*) + 1317
    frame #27: 0x00000001069d1fc0 libcoreclr.dylib`MethodDesc::JitCompileCodeLocked(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*, unsigned int*, CORJIT_FLAGS*) + 288
    frame #28: 0x00000001069d1ac5 libcoreclr.dylib`MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*) + 613
    frame #29: 0x00000001069d122f libcoreclr.dylib`MethodDesc::JitCompileCode(PrepareCodeConfig*) + 479
    frame #30: 0x00000001069d0d9a libcoreclr.dylib`MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 634
    frame #31: 0x00000001069feca5 libcoreclr.dylib`TieredCompilationManager::CompileCodeVersion(NativeCodeVersion) + 261
    frame #32: 0x00000001069fe13f libcoreclr.dylib`TieredCompilationManager::DoBackgroundWork(unsigned long*, unsigned long, unsigned long) + 335
    frame #33: 0x00000001069fd944 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerStart() + 260
    frame #34: 0x00000001069fd778 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerBootstrapper1(void*) + 88
    frame #35: 0x00000001069fa878 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 296
    frame #36: 0x00000001069fade0 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) + 32
    frame #37: 0x00000001069fd6b0 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerBootstrapper0(void*) + 32
    frame #38: 0x00000001068fc4e7 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(void*) + 407
    frame #39: 0x00007ff8036a94e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #40: 0x00007ff8036a4f6b libsystem_pthread.dylib`thread_start + 15

Tiered Compilation background thread managed stack:

(lldb) clrstack
OS Thread Id: 0xa04e5 (11)
        Child SP               IP Call Site
0000700010147108 00007ff80366f3ea [InlinedCallFrame: 0000700010147108]
0000700010147108 0000000115fe120c [InlinedCallFrame: 0000700010147108]
0000700010147100 0000000115FE120C ILStubClass.IL_STUB_PInvoke(IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef)
00007000101471B0 0000000115FE0616 MonoMod.Core.Interop.CoreCLR+V60.InvokeCompileMethod(IntPtr, IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) [/Users/aaron/Documents/MonoMod/src/MonoMod.Core/Interop/CoreCLR.V60.cs @ 50]
0000700010147220 0000000115FE14D6 MonoMod.Core.Platforms.Runtimes.Core21Runtime+JitHookDelegateHolder.CompileMethodHook(IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) [/Users/aaron/Documents/MonoMod/src/MonoMod.Core/Platforms/Runtimes/Core21Runtime.cs @ 172]
00007000101474C0 0000000115FE12E5 ILStubClass.IL_STUB_ReversePInvoke(Int64, Int64, IntPtr, UInt32, IntPtr, IntPtr)
0000700010147EB8 0000000106999222 [DebuggerU2MCatchHandlerFrame: 0000700010147eb8]

Regression?

Unknown; I can't reproduce even on the same runtime.

Known Workarounds

No response

Configuration

Other information

Based on the stack of the two managed threads, I can guess that there's a race somewhere in Thread::RareEnablePreemptiveGC or similar, though I am not at all familiar with that method or anything around it. I am also guessing that the PAL_DispatchExceptionWrapper is called somehow for the debugger to break when it enters the JIT hook method, but again, I am not at all familiar.

janvorli commented 1 year ago

The PAL_DispatchExceptionWrapper presence on the stack means that there was a hardware exception at the managed frame below that. In this case, based on the presence of DebuggerController::DispatchPatchOrSingleStep on the stack, it was a breakpoint coming from the single stepping in the debugger.

However, it is quite strange that we would hang waiting for thread suspension (GC) when there is no other managed thread than the finalizer thread.

@nike4613 when you say VS's remote debugger, do you mean VS on Windows or VS on macOS?

ghost commented 1 year ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
### Description While using VS's remote debugger to debug [MonoMod](https://github.com/MonoMod/MonoMod/tree/reorganize/) on .NET 6 on MacOS x64, I ran into a deadlock while trying to step into a native function which I had hooked back to managed code (using `Marshal.GetFunctionPointerForDelegate` to create the target of the detour). I have not been able to reproduce this, however I did poke around in LLDB, so have some information. MonoMod installs a JIT hook in order to track method recompilations, and this has worked very well for us so far. This is the first time we have seen this, despite having used this JIT hook, mostly unchanged, for at least a year. The application which I saw this happen on was `MonoMod.FrameworkTests`, which locked when trying to step into the call to `msvcrand()` on [line 67](https://github.com/MonoMod/MonoMod/blob/0ab2d0f9b90c08034019b7fa959252c50c467bf0/src/MonoMod.FrameworkTests/Program.cs#L67). Answers to some things Tanner asked when I asked in the C# discord: - `SuppressGCTransition` isn't used anywhere in this project - All of the native to managed thunks are created with `Marshal.GetFunctionPointerForDelegate` or its inverse - The application is entirely single-threaded, and does not call out into any native code which would invoke managed code on a thread the runtime doesn't know about. ### Reproduction Steps 1. Start `MonoMod.FrameworkTests` on MacOS .NET 6 2. Attach VS remote debugger when prompted 3. Set a breakpoint on line 67 of `Program.cs` 4. Resume the application by pressing enter 5. Attempt to step in to `msvcrand()` 6. Program hangs, or rather, did once. ### Expected behavior No deadlock. ### Actual behavior A deadlock occurs. At the time of the deadlock,. there are 11 threads, though only 3 of them are managed threads known by the runtime. ``` (lldb) thread list Process 24949 stopped * thread #1: tid = 0xa03c3, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP thread #2: tid = 0xa03cb, 0x00007ff80366c97a libsystem_kernel.dylib`mach_msg_trap + 10 thread #3: tid = 0xa03cc, 0x00007ff80367134e libsystem_kernel.dylib`kevent + 10 thread #4: tid = 0xa03cd, 0x00007ff8036730aa libsystem_kernel.dylib`poll + 10 thread #5: tid = 0xa03ce, 0x00007ff80366d3ba libsystem_kernel.dylib`read + 10 thread #6: tid = 0xa03cf, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 thread #7: tid = 0xa03d0, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 thread #8: tid = 0xa03d3, 0x00007ff80366e05a libsystem_kernel.dylib`__workq_kernreturn + 10 thread #9: tid = 0xa03d5, 0x00007ff80366d3ba libsystem_kernel.dylib`read + 10 thread #10: tid = 0xa04c7, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 thread #11: tid = 0xa04e5, 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 ``` ``` (lldb) clrthreads ThreadCount: 3 UnstartedThread: 0 BackgroundThread: 2 PendingThread: 0 DeadThread: 0 Hosted Runtime: no Lock DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception 1 1 a03c3 00007FF30F813E00 a0028 Preemptive 0000000186B0E8C0:0000000186B0FFD0 00007FF31080C200 -00001 Ukn 7 2 a03d0 00007FF30F81A800 21228 Preemptive 00000001869AAE78:00000001869ACAA0 00007FF31080C200 -00001 Ukn (Finalizer) 11 3 a04e5 00007FF310830000 a1228 Preemptive 0000000186B17E60:0000000186B17FD0 00007FF31080C200 -00001 Ukn ``` I believe most of the CLR-unknown threads are OS-created threads to handle Mach IPC, debugger related, or other similar threads. Main thread native stack: ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007ff8036a9a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249 frame #2: 0x00000001068f0162 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 322 frame #3: 0x00000001068efd8a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 362 frame #4: 0x00000001068f4609 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2153 frame #5: 0x00000001068f47dd libcoreclr.dylib`WaitForSingleObjectEx + 77 frame #6: 0x0000000106b0f1bf libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 191 frame #7: 0x0000000106b158d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper() + 192 frame #8: 0x0000000106b142c8 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() + 296 frame #9: 0x0000000106b13c35 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() + 277 frame #10: 0x0000000106c1f4d1 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) + 913 frame #11: 0x0000000106c2064f libcoreclr.dylib`DebuggerController::DispatchNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*) + 511 frame #12: 0x0000000106c2da69 libcoreclr.dylib`Debugger::FirstChanceNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*) + 89 frame #13: 0x0000000106b3e0c6 libcoreclr.dylib`HandleHardwareException(PAL_SEHException*) + 454 frame #14: 0x00000001068c4a9b libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 315 frame #15: 0x00000001068fe719 libcoreclr.dylib`PAL_DispatchException + 137 frame #16: 0x00000001068fe3a3 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10 frame #17: 0x0000000115fe13e6 frame #18: 0x0000000115fe12e5 frame #19: 0x0000000106999222 libcoreclr.dylib`invokeCompileMethodHelper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 354 frame #20: 0x00000001069992f2 libcoreclr.dylib`invokeCompileMethod(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 130 frame #21: 0x0000000106999cf5 libcoreclr.dylib`UnsafeJitFunction(PrepareCodeConfig*, COR_ILMETHOD_DECODER*, CORJIT_FLAGS, unsigned int*) + 1317 frame #22: 0x00000001069d1fc0 libcoreclr.dylib`MethodDesc::JitCompileCodeLocked(PrepareCodeConfig*, ListLockEntryBase*, unsigned int*, CORJIT_FLAGS*) + 288 frame #23: 0x00000001069d1ac5 libcoreclr.dylib`MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase*) + 613 frame #24: 0x00000001069d122f libcoreclr.dylib`MethodDesc::JitCompileCode(PrepareCodeConfig*) + 479 frame #25: 0x00000001069d0d9a libcoreclr.dylib`MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 634 frame #26: 0x00000001069d3ed1 libcoreclr.dylib`MethodDesc::DoPrestub(MethodTable*, CallerGCMode) + 545 frame #27: 0x00000001069d3857 libcoreclr.dylib`PreStubWorker + 535 frame #28: 0x0000000106be721b libcoreclr.dylib`ThePreStub + 92 frame #29: 0x000000011601cd53 frame #30: 0x000000011601ccaf frame #31: 0x0000000115861eb7 frame #32: 0x0000000106be65e9 libcoreclr.dylib`CallDescrWorkerInternal + 124 frame #33: 0x0000000106a329ea libcoreclr.dylib`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1530 frame #34: 0x000000010691893e libcoreclr.dylib`RunMain(MethodDesc*, short, int*, PtrArray**) + 734 frame #35: 0x0000000106918c5b libcoreclr.dylib`Assembly::ExecuteMainMethod(PtrArray**, int) + 395 frame #36: 0x000000010694a792 libcoreclr.dylib`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 498 frame #37: 0x00000001069024e4 libcoreclr.dylib`coreclr_execute_assembly + 180 frame #38: 0x000000010618aafd libhostpolicy.dylib`run_app_for_context(hostpolicy_context_t const&, int, char const**) + 1293 frame #39: 0x000000010618b961 libhostpolicy.dylib`corehost_main + 241 frame #40: 0x0000000106113f15 libhostfxr.dylib`fx_muxer_t::handle_exec_host_command(std::__1::basic_string, std::__1::allocator > const&, host_startup_info_t const&, std::__1::basic_string, std::__1::allocator > const&, std::__1::unordered_map, std::__1::allocator >, std::__1::allocator, std::__1::allocator > > >, known_options_hash, std::__1::equal_to, std::__1::allocator, std::__1::allocator >, std::__1::allocator, std::__1::allocator > > > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) + 1621 frame #41: 0x0000000106112d1d libhostfxr.dylib`fx_muxer_t::execute(std::__1::basic_string, std::__1::allocator >, int, char const**, host_startup_info_t const&, char*, int, int*) + 621 frame #42: 0x000000010610f6b8 libhostfxr.dylib`hostfxr_main_startupinfo + 152 frame #43: 0x0000000105f5b854 MonoMod.FrameworkTests`exe_start(int, char const**) + 1508 frame #44: 0x0000000105f5ba60 MonoMod.FrameworkTests`main + 144 frame #45: 0x000000010e4d052e dyld`start + 462 ``` Main thread managed stack: ``` (lldb) clrstack OS Thread Id: 0xa03c3 (1) Child SP IP Call Site 00007FF7B9FAF9E0 0000000115FE13E5 MonoMod.Core.Platforms.Runtimes.Core21Runtime+JitHookDelegateHolder.CompileMethodHook(IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) [/Users/aaron/Documents/MonoMod/src/MonoMod.Core/Platforms/Runtimes/Core21Runtime.cs @ 154] 00007FF7B9FAFC80 0000000115FE12E5 ILStubClass.IL_STUB_ReversePInvoke(Int64, Int64, IntPtr, UInt32, IntPtr, IntPtr) 00007FF7B9FB0460 0000000106999222 [PrestubMethodFrame: 00007ff7b9fb0460] DynamicClass.SyncProxymanaged Get1Delegate>(NativeDetourSyncInfo) 00007FF7B9FB05F0 000000011601CD53 ILStubClass.IL_STUB_ReversePInvoke() 00007FF7B9FB0640 000000011601ccaf [InlinedCallFrame: 00007ff7b9fb0640] 00007FF7B9FB0630 000000011601CCAF ILStubClass.IL_STUB_PInvoke() 00007FF7B9FB06C0 0000000115861EB7 Program.
$(System.String[]) [/Users/aaron/Documents/MonoMod/src/MonoMod.FrameworkTests/Program.cs @ 67] ``` Finalizer thread stack: ``` (lldb) bt * thread #7 * frame #0: 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007ff8036a9a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249 frame #2: 0x00000001068f0162 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 322 frame #3: 0x00000001068efd8a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 362 frame #4: 0x00000001068f4609 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2153 frame #5: 0x00000001068f48a2 libcoreclr.dylib`WaitForMultipleObjectsEx + 82 frame #6: 0x0000000106a77548 libcoreclr.dylib`FinalizerThread::WaitForFinalizerEvent(CLREvent*) + 200 frame #7: 0x0000000106a77664 libcoreclr.dylib`FinalizerThread::FinalizerThreadWorker(void*) + 180 frame #8: 0x00000001069fa878 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 296 frame #9: 0x00000001069fae40 libcoreclr.dylib`ManagedThreadBase::FinalizerBase(void (*)(void*)) + 32 frame #10: 0x0000000106a777d8 libcoreclr.dylib`FinalizerThread::FinalizerThreadStart(void*) + 88 frame #11: 0x00000001068fc4e7 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(void*) + 407 frame #12: 0x00007ff8036a94e1 libsystem_pthread.dylib`_pthread_start + 125 frame #13: 0x00007ff8036a4f6b libsystem_pthread.dylib`thread_start + 15 ``` Tiered Compilation background thread native stack: ``` (lldb) bt * thread #11 * frame #0: 0x00007ff80366f3ea libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007ff8036a9a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249 frame #2: 0x00000001068f0162 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 322 frame #3: 0x00000001068efd8a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 362 frame #4: 0x00000001068f4609 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2153 frame #5: 0x00000001068f47dd libcoreclr.dylib`WaitForSingleObjectEx + 77 frame #6: 0x0000000106b0f1bf libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 191 frame #7: 0x0000000106b158d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper() + 192 frame #8: 0x0000000106b142c8 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() + 296 frame #9: 0x0000000106b13c35 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() + 277 frame #10: 0x0000000106982038 libcoreclr.dylib`JITInlineTrackingMap::AddInliningDontTakeLock(MethodDesc*, MethodDesc*) + 72 frame #11: 0x0000000106981fb1 libcoreclr.dylib`JITInlineTrackingMap::AddInlining(MethodDesc*, MethodDesc*) + 49 frame #12: 0x0000000106992e91 libcoreclr.dylib`CEEInfo::reportInliningDecision(CORINFO_METHOD_STRUCT_*, CORINFO_METHOD_STRUCT_*, CorInfoInline, char const*) + 1761 frame #13: 0x00000001072a3dcb libclrjit.dylib`Compiler::fgInline() + 395 frame #14: 0x00000001073837a8 libclrjit.dylib`Phase::Run() + 24 frame #15: 0x00000001072868ef libclrjit.dylib`Compiler::compCompile(void**, unsigned int*, JitFlags*) + 863 frame #16: 0x000000010728828c libclrjit.dylib`Compiler::compCompileHelper(CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, JitFlags*) + 796 frame #17: 0x0000000107287bd4 libclrjit.dylib`Compiler::compCompile(CORINFO_MODULE_STRUCT_*, void**, unsigned int*, JitFlags*) + 900 frame #18: 0x0000000107288e9b libclrjit.dylib`jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, JitFlags*, void*) + 539 frame #19: 0x000000010728d139 libclrjit.dylib`CILJit::compileMethod(ICorJitInfo*, CORINFO_METHOD_INFO*, unsigned int, unsigned char**, unsigned int*) + 105 frame #20: 0x0000000115fe120c frame #21: 0x0000000115fe0616 frame #22: 0x0000000115fe14d6 frame #23: 0x0000000115fe12e5 frame #24: 0x0000000106999222 libcoreclr.dylib`invokeCompileMethodHelper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 354 frame #25: 0x00000001069992f2 libcoreclr.dylib`invokeCompileMethod(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, CORJIT_FLAGS, unsigned char**, unsigned int*) + 130 frame #26: 0x0000000106999cf5 libcoreclr.dylib`UnsafeJitFunction(PrepareCodeConfig*, COR_ILMETHOD_DECODER*, CORJIT_FLAGS, unsigned int*) + 1317 frame #27: 0x00000001069d1fc0 libcoreclr.dylib`MethodDesc::JitCompileCodeLocked(PrepareCodeConfig*, ListLockEntryBase*, unsigned int*, CORJIT_FLAGS*) + 288 frame #28: 0x00000001069d1ac5 libcoreclr.dylib`MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase*) + 613 frame #29: 0x00000001069d122f libcoreclr.dylib`MethodDesc::JitCompileCode(PrepareCodeConfig*) + 479 frame #30: 0x00000001069d0d9a libcoreclr.dylib`MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 634 frame #31: 0x00000001069feca5 libcoreclr.dylib`TieredCompilationManager::CompileCodeVersion(NativeCodeVersion) + 261 frame #32: 0x00000001069fe13f libcoreclr.dylib`TieredCompilationManager::DoBackgroundWork(unsigned long*, unsigned long, unsigned long) + 335 frame #33: 0x00000001069fd944 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerStart() + 260 frame #34: 0x00000001069fd778 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerBootstrapper1(void*) + 88 frame #35: 0x00000001069fa878 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 296 frame #36: 0x00000001069fade0 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) + 32 frame #37: 0x00000001069fd6b0 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerBootstrapper0(void*) + 32 frame #38: 0x00000001068fc4e7 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(void*) + 407 frame #39: 0x00007ff8036a94e1 libsystem_pthread.dylib`_pthread_start + 125 frame #40: 0x00007ff8036a4f6b libsystem_pthread.dylib`thread_start + 15 ``` Tiered Compilation background thread managed stack: ``` (lldb) clrstack OS Thread Id: 0xa04e5 (11) Child SP IP Call Site 0000700010147108 00007ff80366f3ea [InlinedCallFrame: 0000700010147108] 0000700010147108 0000000115fe120c [InlinedCallFrame: 0000700010147108] 0000700010147100 0000000115FE120C ILStubClass.IL_STUB_PInvoke(IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) 00007000101471B0 0000000115FE0616 MonoMod.Core.Interop.CoreCLR+V60.InvokeCompileMethod(IntPtr, IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) [/Users/aaron/Documents/MonoMod/src/MonoMod.Core/Interop/CoreCLR.V60.cs @ 50] 0000700010147220 0000000115FE14D6 MonoMod.Core.Platforms.Runtimes.Core21Runtime+JitHookDelegateHolder.CompileMethodHook(IntPtr, IntPtr, CORINFO_METHOD_INFO ByRef, UInt32, Byte* ByRef, UInt32 ByRef) [/Users/aaron/Documents/MonoMod/src/MonoMod.Core/Platforms/Runtimes/Core21Runtime.cs @ 172] 00007000101474C0 0000000115FE12E5 ILStubClass.IL_STUB_ReversePInvoke(Int64, Int64, IntPtr, UInt32, IntPtr, IntPtr) 0000700010147EB8 0000000106999222 [DebuggerU2MCatchHandlerFrame: 0000700010147eb8] ``` ### Regression? Unknown; I can't reproduce even on the same runtime. ### Known Workarounds _No response_ ### Configuration - MacOS x64 (Darwin 21.6.0) - .NET 6.0.13 - Visual Studio SSH Remote Debugger ### Other information Based on the stack of the two managed threads, I can *guess* that there's a race somewhere in `Thread::RareEnablePreemptiveGC` or similar, though I am not at all familiar with that method or anything around it. I am also guessing that the `PAL_DispatchExceptionWrapper` is called somehow for the debugger to break when it enters the JIT hook method, but again, I am not at all familiar.
Author: nike4613
Assignees: -
Labels: `area-Diagnostics-coreclr`, `untriaged`
Milestone: -
jkotas commented 1 year ago

However, it is quite strange that we would hang waiting for thread suspension (GC) when there is no other managed thread than the finalizer thread.

Note that the program hacks into the runtime by inserting detour implemented in managed code between VM and JIT. My guess is that this detour corrupts VM state and that leads to hangs and crashes. We have special transitions (JIT_TO_EE_TRANSITION macros and friends) to transition between JIT and VM. The detour will re-enter the VM without proper JIT_TO_EE_TRANSITION transition.

@nike4613 The detour like what you install is not something we support. You may want to build a checked flavor of the runtime - it may give you more clues about what went wrong.

nike4613 commented 1 year ago

@nike4613 when you say VS's remote debugger, do you mean VS on Windows or VS on macOS?

I mean VS on Windows.

Note that the program hacks into the runtime by inserting detour implemented in managed code between VM and JIT. My guess is that this detour corrupts VM state and that leads to hangs and crashes. We have special transitions (JIT_TO_EE_TRANSITION macros and friends) to transition between JIT and VM. The detour will re-enter the VM without proper JIT_TO_EE_TRANSITION transition.

What is special about the JIT<->EE transitions? Is there any way I could force the correct transition to happen?

@nike4613 The detour like what you install is not something we support.

I am largely aware, though as I said, I've never seen this issue before, and haven't been able to reproduce since; this JIT hook works in production across multiple runtime versions right now. As far as I'm aware, there isn't another option to make this functionality work, because we need to know when a method gets recompiled, and be able to install our detour to it before the new method is published.

You may want to build a checked flavor of the runtime - it may give you more clues about what went wrong.

I'll do that, but I don't know what, if anything, I'd find, given that I can't reproduce this issue in the first place.

jkotas commented 1 year ago

Is there any way I could force the correct transition to happen?

No. These transitions are internal implementation detail of the runtime. There is no way to do them outside the runtime.

As far as I'm aware, there isn't another option to make this functionality work, because we need to know when a method gets recompiled

Profiler APIs (https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/icorprofilerinfo-setilfunctionbody-method and friends) are the supported way to instrument methods.

nike4613 commented 1 year ago

Is there any way I could force the correct transition to happen?

No. These transitions are internal implementation detail of the runtime. There is no way to do them outside the runtime.

Good thing we already rely on several implementation details, according to the detected runtime version.

As far as I'm aware, there isn't another option to make this functionality work, because we need to know when a method gets recompiled

Profiler APIs (https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/icorprofilerinfo-setilfunctionbody-method and friends) are the supported way to instrument methods.

Are the profiler APIs available on all platforms? How would one attach to a process on, say Linux? And more than that, how could we do that in-process from managed?

The most common use of MonoMod is game modding, targeting either Unity Mono, whatever version of CoreCLR BepInEx loads for IL2CPP, or whatever runtime an XNA/FNA/MonoGame game is sitting on top of. We already have a great deal of logic to select low-level implementations based on the current runtime, and abuse implementation details for each of them. MonoMod is also used in several places where it must be able to be loaded at some arbitrary point after process start and function, and must be able to use the same binary for different runtimes/architectures/operating systems. It also shouldn't interfere with external tooling, though that's more acceptable as long as it doesn't interfere with debuggers.

Anything that doesn't meet these requirements are non-starters for us, which is why we use the approach we do. If you want, we can discuss this question more on the C# discord, but that's largely irrelevant to this issue (even if our approach is potentially related).

jkotas commented 1 year ago

Profiler APIs are available on all platforms. The profilers have to be written in native code. It is not possible to write fully managed profilers.

There is no supported way to do what you are trying to do today.

hoyosjs commented 1 year ago

Also, for these type of issues we'd likely need a repro - half of the state under the debugger is on the debugger side. I can only see that you hit a hardware exception that the debugger needs to handle and most threads are trying to toggle thread state. However, since you introduced a managed method in between the interfaces as a detour all toggles are potentially off... It becomes a timing game. Anything that sits in between these two layers doesn't have a way to signal components of changes and is unsupported. Even profiling can mess with the debugger, but something as low as messing with the thread state is even more likely to cause runtime issues. Also, since profilers are native you can't use the same DLL.

nike4613 commented 1 year ago

Also, for these type of issues we'd likely need a repro

That's understandable. I'll see if I can find a reliable way to repro.

We'd also be OK with somehow disabling step-in to the JIT hook, though I don't know of a good way to do that.

hoyosjs commented 1 year ago

You'd still need a step through which will place breakpoints there even if you don't notice them. Placing a breakpoint at the other side is the only non-stepping solution. However, if the hook does anything the debugger triggers on (thread creation for some bg processing, some type loading, evaluation of properties) you'll likely end up in the same place. Running managed code in suspended states is a place where if the runtime is not cooperating, you'll likely end up in states that we can't guarantee the correctness of.

nike4613 commented 1 year ago

Quick update-- I am able to successfully run under a checked runtime (release/6.0) with no issues, JIT hook and all. Now to try to reproduce this issue...

nike4613 commented 1 year ago

I've figured out how to reproduce the deadlock under a checked build, though it seems to be happening in a slightly different place now. I have a full process dump of the application in this state, before it died due to asking LLDB for a thread backtrace... I'm also not able to get a managed backtrace from SOS while debugging the dump, but maybe I'm just doing something wrong there. I had enabled the CLR log with DOTNET_LogFacility=20010000 as well, if that may be helpful.

I'll try to upload the dump somewhere in a minute; it's about 3.6GB uncompressed.


E: My reproduction steps are as follows:

  1. Set a breakpoint at line 161 of src/MonoMod.Core/Platforms/Architectures/AltEntryFactories/IcedAltEntryFactory.cs
  2. Start the application
  3. Attach debugger when prompted
  4. Press enter in terminal
  5. F5/Continue until the breakpoint previously mentioned is hit
  6. Step-over until it locks up after stepping over line 190
nike4613 commented 1 year ago

Oh, not really related, but I did notice when I tried to step-in once, I hit a CLR assert in debugging support code: https://github.com/dotnet/runtime/blob/19fde2f5b9dd7c8b5f37e9f02688ff9b708b24b5/src/coreclr/debug/ee/controller.cpp#L3917

nike4613 commented 1 year ago

Dump is here. It is zstd compressed to get the size down. It is a checked build of 19fde2f5b9dd7c8b5f37e9f02688ff9b708b24b5 on MacOS Monterey x64, built with ./build.sh clr+libs -rc Checked -lc Release and run using corerun. Let me know if you need the DAC build for it.