Open Sergio0694 opened 2 weeks ago
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.
cc. @AaronRobinsonMSFT @jkoritzinsky perhaps you might also have thoughts on this or some knowledge to share? š
Also + @VSadov in case this could be related to suspension
Also + @VSadov in case this could be related to suspension
It's not related to suspension, but definitely an area where expertise from @VSadov would help. Looks like native AOT will just wait on an event using a OS API and that's all there's to it. CoreCLR does a lot more to figure out how exactly to wait (Does the thread have a SynchronizationContext? Do we need to pump window messages with MsgWaitForMultipleObjects? etc.). As usual, this is a place that is a lot more complicated in CoreCLR VM and it's hard to say which part is relevant and how much of it we need to implement.
I think waiting with "alertable" is to allow a sleeping/waiting thread to react to Thread.Interrupt()
.
However the Thread.Interrupt()
is not supported on NativeAOT
I suspect the culprit is something with COM and message pumping, but it looks like we call CoWaitForMultipleHandles
, which should do that.
CoreCLR may be doing the wait via user installed synchronization context. It would be useful to find out the callstack of the wait in CoreCLR.
I did have a chat with some COM folks. On ASTA threads such as this one, CoWaitForMultipleHandles
doesn't pump COM messages unless COWAIT_DISPATCH_CALLS is passed as the flag. But that isn't passed in the CoreClr case here either. So that might explain why it is waiting but doesn't explain why CoreClr doesn't hit this issue.
My initial guess was the CleanupWrappersInCurrentCtxThread
call was doing some cleanup for the ASTA scenarios before it waited, but doing some debugging of the CoreCLR scenario, it seems it isn't as there is a conditional that makes it used in certain scenarios which isn't set here. And from my debugging, it seems we are doing the same wait on CoreClr similar to AOT, but with alertable set. Not sure if that is somehow making us get lucky to not hitting this issue due to other APC calls happening which is what that seems to control.
Not sure if that is somehow making us get lucky to not hitting this issue due to other APC calls happening
You can build a local native AOT package that changes this to alertable=true and try to repro it with that to prove or disprove this hypothesis.
Not sure if that is somehow making us get lucky to not hitting this issue due to other APC calls happening
You can build a local native AOT package that changes this to alertable=true and try to repro it with that to prove or disprove this hypothesis.
If there's something you'd like to try, here are the accelerated steps:
Then publish your project with native AOT as usual (might want to delete all of bin/obj before since this is not incremental), but set IlcSdkPath property like this <IlcSdkPath>{REPO_PATH}\artifacts\bin\coreclr\windows.x64.Release\aotsdk\</IlcSdkPath>
, replacing {REPO_PATH} with where you cloned the runtime repo. This will pick up your build of runtime/corelib. The nice thing about this is that once you do this, you can set breakpoints and debug within the code. You should also be able to pass -c Debug
to the build.cmd invocation to build the debug version of the runtime (it will be dropped to a similar path under artifacts) and use that instead - it's easier to debug.
This is not native AOT specific.
I stepped through CoreCLR VM version of this. We end up taking a path where we wait like this:
> coreclr.dll!MsgWaitHelper(int numWaiters, void * * phEvent, int bWaitAll, unsigned long millis, int bAlertable) Line 3140 C++
coreclr.dll!Thread::DoAppropriateAptStateWait(int numWaiters, void * * pHandles, int bWaitAll, unsigned long timeout, WaitMode mode) Line 3178 C++
coreclr.dll!Thread::DoAppropriateWaitWorker(int countHandles, void * * handles, int waitAll, unsigned long millis, WaitMode mode, void * associatedObjectForMonitorWait) Line 3363 C++
coreclr.dll!Thread::DoAppropriateWait(int countHandles, void * * handles, int waitAll, unsigned long millis, WaitMode mode, PendingSync * syncState) Line 3032 C++
[Inline Frame] coreclr.dll!CLREventBase::WaitEx(unsigned long) Line 459 C++
coreclr.dll!CLREventBase::Wait(unsigned long dwMilliseconds, int alertable, PendingSync * syncState) Line 413 C++
coreclr.dll!FinalizerThread::FinalizerThreadWait() Line 599 C++
coreclr.dll!InteropLibImports::WaitForRuntimeFinalizerForExternal() Line 1148 C++
[Inline Frame] Windows.UI.Xaml.dll!DirectUI::ReferenceTrackerManager::TriggerFinalization() Line 350 C++
Windows.UI.Xaml.dll!DirectUI::DXamlCore::OnAfterAppSuspend() Line 4155 C++
Windows.UI.Xaml.dll!XAML::PLM::PLMHandler::InvokeAfterAppSuspendCallback() Line 431 C++
Windows.UI.Xaml.dll!XAML::PLM::PLMHandler::DecrementAppSuspendActivityCount() Line 226 C++
There's a difference between native AOT and CoreCLR - on CoreCLR we pass COWAIT_ALERTABLE
, on native AOT we don't.
In the end it doesn't make a difference because they both deadlock.
Description
We're hitting a 100% consistent hang during application shutdown, only on Native AOT. It seems that the finalizer thread and the UI thread (ASTA) are possibly in a deadlock, resulting in the application process remaining alive after closing the main window. After a few seconds, Windows proceeds to kill the process, which shows up in WER as a hang (which is expected). Only repros with Native AOT, whereas CoreCLR seems to work fine.
Reproduction Steps
I don't have a minimal repro. Please ping me on Teams for instructions on how to deploy the Store locally to repro. Alternatively, I can also share an MSIX package for sideloading, with instructions on how to install it for testing (and how to restore the retail Store after that).
Here is a memory dump on the process during the hang (process was paused from WinDbg on the presumed deadlock).
Expected behavior
The application should shutdown correctly when closing the window.
Actual behavior
Here's the two relevant stacktraces I see in WinDbg.
Finalizer thread (!FinalizerStart) (click to expand)
``` [0x0] ntdll!ZwWaitForMultipleObjects+0x14 [0x1] KERNELBASE!WaitForMultipleObjectsEx+0xe9 [0x2] combase!MTAThreadWaitForCall+0xfb [0x3] combase!MTAThreadDispatchCrossApartmentCall+0x2bc [0x4] combase!CSyncClientCall::SwitchAptAndDispatchCall+0x707 (Inline Function) (Inline Function) [0x5] combase!CSyncClientCall::SendReceive2+0x825 [0x6] combase!SyncClientCallRetryContext::SendReceiveWithRetry+0x2f (Inline Function) (Inline Function) [0x7] combase!CSyncClientCall::SendReceiveInRetryContext+0x2f (Inline Function) (Inline Function) [0x8] combase!DefaultSendReceive+0x6e [0x9] combase!CSyncClientCall::SendReceive+0x300 [0xa] combase!CClientChannel::SendReceive+0x98 [0xb] combase!NdrExtpProxySendReceive+0x58 [0xc] RPCRT4!Ndr64pSendReceive+0x39 (Inline Function) (Inline Function) [0xd] RPCRT4!NdrpClientCall3+0x3de [0xe] combase!ObjectStublessClient+0x14c [0xf] combase!ObjectStubless+0x42 [0x10] combase!CObjectContext::InternalContextCallback+0x2fd [0x11] combase!CObjectContext::ContextCallback+0x902 [0x12]
UI thread (shcore!_WrapperThreadProc ApplicationView ASTA) (click to expand)
``` [0x0] win32u!ZwUserMsgWaitForMultipleObjectsEx+0x14 [...] [0x5] combase!CoWaitForMultipleHandles+0xc2 [0x6]
It seems that:
ComWrappers::IReferenceTrackerHost::ReleaseDisconnectedReferenceSources
is calledGC::WaitForPendingFinalizers
ObjectReferenceWithContext<T>
objectCallInContext
(here)WaitForMultipleObjectsEx
Some potentially relevant differences we noticed in the finalizer logic across CoreCLR (which works fine) and NativeAOT:
CoreCLR: https://github.com/dotnet/runtime/blob/302e0d4cf9d603fbc76e508b0b41e778c69f2186/src/coreclr/vm/finalizerthread.cpp#L493-L548
Native AOT: https://github.com/dotnet/runtime/blob/302e0d4cf9d603fbc76e508b0b41e778c69f2186/src/coreclr/nativeaot/Runtime/FinalizerHelpers.cpp#L115-L151
It seems that they're similar, however:
alertable: TRUE
alertable: FALSE
allowReentrantWait: TRUE
param, which makes it callCoWaitForMultipleHandles
hereNot sure whether that's intentional (why?) and whether it's related to the issue, just something we noticed.
Regression?
No.
Known Workarounds
None, this is a blocker š
Configuration