dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.8k stars 4.61k forks source link

Avalonia calling into the runtime after shut down #90079

Open filipnavara opened 1 year ago

filipnavara commented 1 year ago

Looks like a bad interaction of diagnostsic bits -- we are trying to notify the debugger of an exception and of a class oad (triggered by a jit request for PGO dat).

Stack traces: Sample of Avalonia Application.txt

Repro on macOS:

Happens with .NET 6 and .NET 7, didn't test it on newer runtime.

ghost commented 1 year ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
> Looks like a bad interaction of diagnostsic bits -- we are trying to notify the debugger of an exception and of a class oad (triggered by a jit request for PGO dat). Stack traces: [Sample of Avalonia Application.txt](https://github.com/dotnet/runtime/files/12270170/Sample.of.Avalonia.Application.txt) Repro on macOS: - Sample project: https://github.com/filipnavara/emptyavalonia (empty Avalonia app) - Run from VS Code with F5 - Press Cmd+Q Happens with .NET 6 and .NET 7, didn't test it on newer runtime.
Author: filipnavara
Assignees: -
Labels: `area-Diagnostics-coreclr`, `untriaged`
Milestone: -
filipnavara commented 1 year ago

cc @AndyAyersMS

AndyAyersMS commented 1 year ago

cc @AndyAyersMS

Thanks for opening an issue.

tommcdon commented 1 year ago

@filipnavara Thanks for reporting the issue. Does this reproduce on ARM64 (M1/M2) or X64 macOS (or both)?

filipnavara commented 1 year ago

Tested and reproduced on M1. The underlying mechanism of the deadlock doesn't seem platform specific though.

tommcdon commented 1 year ago

@filipnavara thanks!

mikelle-rogers commented 1 month ago

I tried reproducing it with .NET 6 on Windows and wasn't able to, will continue once I have access to M1.

mikelle-rogers commented 1 month ago

I also tried reproducing it with .NET 7 on Mac x64, but wasn't able to because there was a NullReferenceException in the Avalonia code.

mikelle-rogers commented 1 month ago

Here is the stack of the avalonia app (was able to reproduce on M1)

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000100d8fbe8 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1760
    frame #1: 0x0000000100e74c04 libcoreclr.dylib`Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 1040
    frame #2: 0x0000000100e6ffa4 libcoreclr.dylib`Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 68
    frame #3: 0x0000000101157320 libcoreclr.dylib`WaitForEndOfShutdown() + 72
    frame #4: 0x0000000100dd7c90 libcoreclr.dylib`CrstBase::ReleaseAndBlockForShutdownIfNotSpecialThread() + 120
    frame #5: 0x0000000101070ec8 libcoreclr.dylib`Debugger::LockForEventSending(Holder<Debugger*, &(Debugger::AcquireDebuggerLock(Debugger*)), &(Debugger::ReleaseDebuggerLock(Debugger*)), 0ul, &(int CompareDefault<Debugger*>(Debugger*, Debugger*)), true>*) + 48
    frame #6: 0x000000010106881c libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) + 512
    frame #7: 0x0000000101069b28 libcoreclr.dylib`DebuggerController::DispatchNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*) + 596
    frame #8: 0x000000010107621c libcoreclr.dylib`Debugger::FirstChanceNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*, int) + 124
    frame #9: 0x0000000100fa4044 libcoreclr.dylib`HandleHardwareException(PAL_SEHException*) + 396
    frame #10: 0x0000000100d65860 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 348
    frame #11: 0x0000000100d98888 libcoreclr.dylib`PAL_DispatchException + 136
    frame #12: 0x0000000100d98518 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
    frame #13: 0x0000000104cbea68
    frame #14: 0x0000000102cf6240 libAvaloniaNative.dylib`ComPtr<IAvnDispatcher>::~ComPtr() + 36
    frame #15: 0x000000019f9482e8 libsystem_c.dylib`__cxa_finalize_ranges + 476
    frame #16: 0x000000019f948070 libsystem_c.dylib`exit + 44
    frame #17: 0x000000019faa38d0 libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 20
    frame #18: 0x000000019f70212c dyld`start + 2436
mikelle-rogers commented 1 month ago

After further investigation, we found that the Runtime was shut down and Avalonia is calling into the Runtime after it is shut down. This is unsupported behavior, so we have opened an issue in Avalonia here. We will close this issue.

We know the Runtime is shut down because

g_fEEShutDown = 0x00000027 = ShutDown_Start|ShutDown_Finalize1|ShutDown_Finalize2|ShutDown_SyncBlock`. 

The call stack of the app when it is hanging is: 
        Child SP               IP Call Site
000000016BD73D00 0000000195CA334C libsystem_platform.dylib!_platform_memmove + 428
000000016BD73D00 0000000106AC1BD8 libcoreclr.dylib!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode, void*) + 924
000000016BD73F60 0000000106AC11BC libcoreclr.dylib!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)::$_1::operator()(Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)::__EEParam*) const + 92
000000016BD73F90 0000000106AB98D0 libcoreclr.dylib!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 352
000000016BD740A0 0000000106CD3F80 libcoreclr.dylib!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 328
000000016BD740F0 0000000106CD3E20 libcoreclr.dylib!CLREventBase::Wait(unsigned int, int, PendingSync*) + 84
000000016BD74130 00000001070D1468 libcoreclr.dylib!WaitForEndOfShutdown() + 132
000000016BD74150 0000000106947A94 libcoreclr.dylib!CrstBase::ReleaseAndBlockForShutdownIfNotSpecialThread() + 68
000000016BD74190 0000000106F672C8 libcoreclr.dylib!Debugger::DoNotCallDirectlyPrivateLock() + 400
000000016BD741E0 0000000106F54D34 libcoreclr.dylib!Debugger::AcquireDebuggerLock(Debugger*) + 24
000000016BD74200 0000000106F65E48 libcoreclr.dylib!FunctionBase<Debugger*, &(Debugger::AcquireDebuggerLock(Debugger*)), &(Debugger::ReleaseDebuggerLock(Debugger*))>::DoAcquire() + 28
000000016BD74220 0000000106F65DE8 libcoreclr.dylib!BaseHolder<Debugger*, FunctionBase<Debugger*, &(Debugger::AcquireDebuggerLock(Debugger*)), &(Debugger::ReleaseDebuggerLock(Debugger*))>, 0ul, &(int CompareDefault<Debugger*>(Debugger*, Debugger*))>::Acquire() + 100
000000016BD74240 0000000106F6C4F8 libcoreclr.dylib!Debugger::LockForEventSending(Holder<Debugger*, &(Debugger::AcquireDebuggerLock(Debugger*)), &(Debugger::ReleaseDebuggerLock(Debugger*)), 0ul, &(int CompareDefault<Debugger*>(Debugger*, Debugger*)), true>*) + 100
000000016BD74270 0000000106F54708 libcoreclr.dylib!DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) + 1204
000000016BD74460 0000000106F57A1C libcoreclr.dylib!DebuggerController::DispatchNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*) + 1380
000000016BD74510 0000000106F781D0 libcoreclr.dylib!Debugger::FirstChanceNativeException(_EXCEPTION_RECORD*, _CONTEXT*, unsigned int, Thread*, int) + 168
000000016BD74590 0000000106D2D98C libcoreclr.dylib!HandleHardwareException(PAL_SEHException*) + 1940
000000016BD75EB0 0000000106809C58 libcoreclr.dylib!SEHProcessException(PAL_SEHException*) + 256
000000016BD75EE0 000000010689C744 libcoreclr.dylib!PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 308
000000016BD75F90 000000010689C5DC libcoreclr.dylib!PAL_DispatchException + 52
000000016BD765D0 000000010689BB30 libcoreclr.dylib!PAL_DispatchExceptionWrapper + 16
000000016BD76FB0 0000000301A34B90 
FFFFFFFFFFFFFFFF 0000000107616240 libAvaloniaNative.dylib!ComPtr<IAvnDispatcher>::~ComPtr() + 36
000000016BD77000 0000000195B302E8 libsystem_c.dylib!__cxa_finalize_ranges + 476
000000016BD77060 0000000195B30070 libsystem_c.dylib!exit + 44
000000016BD77080 0000000195C8B8D0 libdyld.dylib!dyld4::LibSystemHelpers::getenv(char const*) const
000000016BD77090 00000001958EA12C dyld!start + 2436

The frame without a symbol is MicroCom.Runtime.MicroComVtblBase.Release(MicroCom.Runtime.Ccw*).

jkotas commented 1 month ago

Avalonia is calling into the Runtime after it is shut down. This is unsupported behavior.

That's not correct. This is expected to work and not hang. We have fixed number of similar bugs. For example https://github.com/dotnet/runtime/pull/100293 was fixing a very similar hang/crash where some networking code was calling into the runtime after the shutdown.

Runtime shutdown is not really a thing. We are really only notifying the diagnostic stack about the shutdown. However, the runtime is expected to work until the OS physically kills the process and the diagnostic stack is expected to stay out of the way once it was notified about shutdown.

It is fine for the debugger experience to be degraded after the diagnostic stack was notified about the shutdown. However, it is not expected to hang.

jkotas commented 1 month ago

we have opened an issue in Avalonia https://github.com/AvaloniaUI/Avalonia/issues/16368

I agree that Avalonia may want to avoid calling into the runtime during shutdown. Minimizing work done during shutdown is general goodness. But it is not changing the fact that we have a bug in the runtime shutdown logic.

mikelle-rogers commented 4 weeks ago

This issue no longer repros when using Avalonia 11.0.11.

filipnavara commented 4 weeks ago

This issue no longer repros when using Avalonia 11.0.11.

Well, yeah, because Avalonia fixed the issue that triggered it... which still doesn't fix the debugger/runtime behavior even if the Avalonia users are unblocked.

tommcdon commented 4 weeks ago

@jkotas Since Avalonia has already fixed the issue, shall we close this issue?

filipnavara commented 4 weeks ago

@jkotas Since Avalonia has already fixed the issue, shall we close this issue?

I am fine with deprioritizing any work on this, but it's NOT fixed. The deadlock can still be triggered with the latest runtime. It just happened that Avalonia 11.0.0 was a high profile library that triggered it, and that particular version continues to trigger it today. The same problem can be triggered by other code, and I have seen it happening. This was just convenient way to reproduce it.

jkotas commented 4 weeks ago

I agree with @filipnavara. This bug is not fixed.

We were lucky that the Avalonia repro was reliable enough to get to the bottom of it. The most likely manifestation of this bug is intermittent hang at the end of debugging session. My guess is that it is hit intermittently by quite a few users out there in number of workloads that we just do not know about.