dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

Apple Silicon JIT/jit64/opt/cse/VolatileTest_op_* test ThreadSuspend deadlocks in GCStress=c #50381

Closed sdmaclea closed 2 years ago

sdmaclea commented 3 years ago

The JIT/jit64/opt/cse/VolatileTestop* family of tests sometimes deadlock in ThreadSuspend::SuspendRuntime(...) on Apple Silicon when in GCStress=c mode.

Looking at a core dump.

Thread 1 -- TestOp() times out trying to suspend for GC. Thread 8 -- TieredCompilationManager::TryDeactivateTieringDelay() is constructing a GCCoop() and while waiting in native code for a pthread cond wait. I tlooks like it is trying to force Thread 1 pThread=0x000000015c82f210 into preemptive mode. Thread 10 -- Is running Flip(). It is running GC interruptible code.

Looks like the GC Mode for thread 10 is incorrect.

STRESS LOG:
    facilitiesToLog  = 0x80080041
    levelToLog       = 9
    MaxLogSizePerThread = 0x2000000 (33554432)
    MaxTotalLogSize = 0x40000000 (1073741824)
    CurrentTotalLogChunk = 1875
    ThreadsWithLogs  = 5
    Clock frequency  = 1.000 GHz
    Total elapsed time 40.820 sec

THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
ebef1  40.820272625 : `SYNC`                   Timed out waiting for rendezvous event 1 threads remaining
ebef1  40.819018083 : `SYNC`               Waiting for suspend event 1 threads remaining
ebef1  40.819018042 : `SYNC`               Spinning, 1 threads remaining
...
...
...
ebef1   0.820775708 : `SYNC`                   Timed out waiting for rendezvous event 1 threads remaining
ebef1   0.819490125 : `SYNC`               Waiting for suspend event 1 threads remaining
ebef1   0.819490083 : `SYNC`               Spinning, 1 threads remaining
ebef1   0.819489042 : `SYNC`                   Inspecting thread 0x5c83d010 ID 0x4 coop mode = 1
ebef1   0.819489000 : `SYNC`                   Inspecting thread 0x5c82f210 ID 0x3 coop mode = 0
ebef1   0.819488917 : `SYNC`                   Inspecting thread 0x5c829c10 ID 0x2 coop mode = 0
ebef1   0.819468125 : `SYNC`               Thread::SuspendRuntime(reason=0x1)
ebef1   0.819468042 : `SYNC`               Reset forbid suspend [0] for thread 0x15c820810.
ebef1   0.819467958 : `SYNC`               Set forbid suspend [0] for thread 0x15c820810.
ebef1   0.819418292 : `SYNC`               Thread::ResumeRuntime() - End
ebef1   0.819418167 : `SYNC`               Thread::ResumeRuntime(finishedGC=1, SuspendSucceeded=1) - Start
ebef1   0.819417792 : `SYNC`               Reset forbid suspend [0] for thread 0x15c820810.
ebef1   0.819417708 : `SYNC`               Set forbid suspend [0] for thread 0x15c820810.
ebef1   0.819404667 : `GC`GCALLOC`GCROOTS` ========== ENDGC 4877 (gen = 2, collect_classes = 1) ==========={
(lldb) clrthreads                                                                                                                                                                                                                             
ThreadCount:      4                                                                                                                                                                                                                           
UnstartedThread:  0                                                                                                                                           
BackgroundThread: 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
PendingThread:    0                                                                                                                                                                                         
DeadThread:       0                                                                                                                                                                                                                           
Hosted Runtime:   no                                                                                                                                                                                                                                                                          
                                                                                                            Lock                                                                                                                                                                              
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception                                                                                                                                                                                                                                                                                                                                                             
XXXX    1    ebef1 000000015C820810    20020 Preemptive  0000000000000000:0000000000000000 000000015C81EA10 -00001 Ukn (GC)                                   
XXXX    2    ebef8 000000015C829C10    21220 Preemptive  0000000000000000:0000000000000000 000000015C81EA10 -00001 Ukn (Finalizer)                                                                                                                                                            
XXXX    3    ebefa 000000015C82F210    21220 Preemptive  0000000000000000:0000000000000000 000000015C81EA10 -00001 Ukn                                                                                      
XXXX    4    ebf02 000000015C83D010    21022 Cooperative 0000000000000000:0000000000000000 000000015C81EA10 -00001 Ukn                                                                                                                        
(lldb) bt all                                                                                                                                                                                                                                 
  thread #1, stop reason = signal SIGSTOP                                                                                                                                                                                                                                                     
    frame #0: 0x0000000100cfd624 libcoreclr.dylib`DBG_DebugBreak at debugbreak.S:7                                                                                                                                                                                                                                                                                                                                                                                                          
    frame #1: 0x0000000100c91e04 libcoreclr.dylib`::DebugBreak() at debug.cpp:406:9 [opt]                                                                     
    frame #2: 0x0000000100feb754 libcoreclr.dylib`ThreadSuspend::SuspendRuntime(reason=<unavailable>) at threadsuspend.cpp:3541:17 [opt]                      
    frame #3: 0x0000000100fee4a0 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5641:5 [opt]                                                                                                                                                           
    frame #4: 0x0000000100f153c8 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=SUSPEND_FOR_GC) at gcenv.ee.cpp:26:5 [opt]                                                                                                                                                                
    frame #5: 0x000000010108e830 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=2, reason=reason_gcstress) at gc.cpp:42144:9 [opt]                                                                            
    frame #6: 0x00000001010b8d60 libcoreclr.dylib`WKS::GCHeap::GarbageCollect(int, bool, int) [inlined] WKS::GCHeap::GarbageCollectTry(this=<unavailable>, generation=<unavailable>, low_memory_p=<unavailable>, mode=<unavailable>) at gc.cpp:41435:12 [opt]                                 
    frame #7: 0x00000001010b8d54 libcoreclr.dylib`WKS::GCHeap::GarbageCollect(this=<unavailable>, generation=<unavailable>, low_memory_p=<unavailable>, mode=<unavailable>) at gc.cpp:41369 [opt]                                             
    frame #8: 0x00000001010b8598 libcoreclr.dylib`WKS::GCHeap::StressHeap(this=0x000000015c004ba0, context=<unavailable>) at gc.cpp:41004:9 [opt]                                                                                                                                             
    frame #9: 0x0000000100f15054 libcoreclr.dylib`DoGcStress(regs=0x000000015c83e000, nativeCodeVersion=<unavailable>) at gccover.cpp:1745:35 [opt]           
    frame #10: 0x0000000100f14c5c libcoreclr.dylib`OnGcCoverageInterrupt(regs=<unavailable>) at gccover.cpp:1366:5 [opt]                                      
    frame #11: 0x0000000100da7740 libcoreclr.dylib`IsGcMarker(pContext=0x000000015c83e000, pExceptionRecord=<unavailable>) at excep.cpp:6492:17 [opt]                             
    frame #12: 0x000000010101807c libcoreclr.dylib`HandleHardwareException(ex=0x000000016f5014e8) at exceptionhandling.cpp:5148:59 [opt]                                          
    frame #13: 0x0000000100c931dc libcoreclr.dylib`SEHProcessException(exception=0x000000016f5014e8) at seh.cpp:267:21 [opt]                                  
    frame #14: 0x0000000100cfdd90 libcoreclr.dylib`::PAL_DispatchException(pContext=0x000000016f502110, pExRecord=0x000000016f502078, pMachExceptionInfo=0x000000016f501b38) at machexception.cpp:388:29 [opt]                                                                                
    frame #15: 0x0000000100cfd6a8 libcoreclr.dylib`PAL_DispatchExceptionWrapper at dispatchexceptionwrapper.S:43                                                                                                                                                                              
    frame #16: 0x00000002812bdc28                          // Managed TestCSE.TestOp()                                                                                                                                                 
    frame #17: 0x00000002812bc52c                                                                                                                                                                                                                                                             
    frame #18: 0x0000000280d9b994                                                                                                                                                                                                                                                             
    frame #19: 0x00000001010c0d0c libcoreclr.dylib`CallDescrWorkerInternal at calldescrworkerarm64.S:71                                                                                                                                                                                       
    frame #20: 0x0000000100ec225c libcoreclr.dylib`CallDescrWorkerWithHandler(pCallDescrData=0x000000016f502770, fCriticalCall=<unavailable>) at callhelpers.cpp:71:5 [opt]                                                                                                                   
    frame #21: 0x0000000100ec2ba4 libcoreclr.dylib`MethodDescCallSite::CallTargetWorker(this=0x000000016f502958, pArguments=0x000000016f502920, pReturnValue=0x000000016f502a38, cbReturnValue=8) at callhelpers.cpp:547:9 [opt]
    frame #22: 0x0000000100d209b4 libcoreclr.dylib`RunMain(MethodDesc*, short, int*, REF<PtrArray>*) [inlined] MethodDescCallSite::Call_RetArgSlot(this=0x000000016f502958, pArguments=0x000000016f502920) at callhelpers.h:458:9 [opt]
    frame #23: 0x0000000100d20980 libcoreclr.dylib`RunMain(MethodDesc*, short, int*, REF<PtrArray>*) at assembly.cpp:1481 [opt]                               
    frame #24: 0x0000000100d207a0 libcoreclr.dylib`RunMain(MethodDesc*, short, int*, REF<PtrArray>*) [inlined] RunMain(this=<unavailable>, pParam=0x000000016f5028a8)::$_0::operator()(Param*) const::'lambda'(Param*)::operator()(Param*) const at assembly.cpp:1549 [opt]                                                                                                                                                                                                                 
    frame #25: 0x0000000100d207a0 libcoreclr.dylib`RunMain(MethodDesc*, short, int*, REF<PtrArray>*) at assembly.cpp:1551 [opt]                                                                             
    frame #26: 0x0000000100d206c4 libcoreclr.dylib`RunMain(pFD=<unavailable>, numSkipArgs=<unavailable>, piRetVal=<unavailable>, stringArgs=<unavailable>) at assembly.cpp:1551 [opt]                                                         
    frame #27: 0x0000000100d20d4c libcoreclr.dylib`Assembly::ExecuteMainMethod(this=<unavailable>, stringArgs=0x000000016f502df8, waitForOtherThreads=YES) at assembly.cpp:1661:18 [opt]                                                                                                      
    frame #28: 0x0000000100d75e2c libcoreclr.dylib`CorHost2::ExecuteAssembly(this=<unavailable>, dwAppDomainId=<unavailable>, pwzAssemblyPath=<unavailable>, argc=<unavailable>, argv=0x0000000000000000, pReturnValue=0x000000016f502fbc) at corhost.cpp:384:39 [opt]                        
    frame #29: 0x0000000100d02344 libcoreclr.dylib`::coreclr_execute_assembly(hostHandle=0x000000015c113c50, domainId=1, argc=0, argv=<unavailable>, managedAssemblyPath="/Users/stmaclea/git/runtime/artifacts/tests/coreclr/OSX.arm64.Checked/JIT/jit64/opt/cse/VolatileTest_op_sub/VolatileTest_op_sub.dll", exitCode=0x000000016f502fbc) at unixinterface.cpp:446:24 [opt]
    frame #30: 0x0000000100900230 corerun`main [inlined] run(config=0x000000016f502f70) at corerun.cpp:358:18 [opt]                                           
    frame #31: 0x0000000100900124 corerun`main(argc=<unavailable>, argv=<unavailable>) at corerun.cpp:509 [opt]                                                                                                                                                                               
    frame #32: 0x000000018e164420 libdyld.dylib`start + 4                                                                                                                                                   
  thread #2, stop reason = signal SIGSTOP                                                                                                                                                                                                     
    frame #0: 0x000000018e10ce54 libsystem_kernel.dylib`mach_msg_trap + 8                                                                                     
    frame #1: 0x000000018e10d204 libsystem_kernel.dylib`mach_msg + 76                                                                                                                                       
    frame #2: 0x0000000100cff2bc libcoreclr.dylib`MachMessage::Receive(this=0x000000016f589500, hPort=<unavailable>) at machmessage.cpp:43:15 [opt]                                                                                                                                           
    frame #3: 0x0000000100cfdf90 libcoreclr.dylib`SEHExceptionThread(args=<unavailable>) at machexception.cpp:1013:18 [opt]                                                                                 
    frame #4: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                                                                 
  thread #3, stop reason = signal SIGSTOP                                                                                                                                                                                                                                                     
    frame #0: 0x000000018e112d64 libsystem_kernel.dylib`kevent + 8                                                                                                                                                                                                                            
    frame #1: 0x0000000100cea0a8 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(this=0x000000015d014600, iTimeout=-1, pRecvBuf="", iBytes=1) at synchmanager.cpp:2195:28 [opt]                                
    frame #2: 0x0000000100ce9668 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(this=0x000000015d014600, iPollTimeout=-1, pswcWorkerCmd=0x000000016f616f2c, pshridMarshaledData=0x000000016f616f20, pdwData=0x000000016f616f1c) at synchmanager.cpp:2011:16 [opt]                                                                                                                                                                                             
    frame #3: 0x0000000100ce8a00 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::WorkerThread(pArg=0x000000015d014600) at synchmanager.cpp:1714:37 [opt]                                              
    frame #4: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015d015200) at thread.cpp:1862:16 [opt]                                                                  
    frame #5: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                                                                 
  thread #4, stop reason = signal SIGSTOP                                                                                                                                                                                                                                                     
    frame #0: 0x000000018e114f90 libsystem_kernel.dylib`poll + 8                                                                                                                                            
    frame #1: 0x000000010114efe0 libcoreclr.dylib`ds_ipc_poll(_DiagnosticsIpcPollHandle*, unsigned long, unsigned int, void (*)(char const*, unsigned int)) [inlined] ipc_poll_fds(pollfd*, unsigned long, unsigned int) at ds-ipc-pal-socket.c:468:17 [opt]                                                                                                                                                                                                                                
    frame #2: 0x000000010114efc0 libcoreclr.dylib`ds_ipc_poll(poll_handles_data=0x000000016f6a2bc8, poll_handles_data_len=1, timeout_ms=4294967295, callback=(libcoreclr.dylib`server_warning_callback(char const*, unsigned int) at ds-server.c:109))(char const*, unsigned int)) at ds-ipc-pal-socket.c:1065 [opt]                                                                                                                                                                        
    frame #3: 0x000000010124fe50 libcoreclr.dylib`ds_ipc_stream_factory_get_next_available_stream(callback=(libcoreclr.dylib`server_warning_callback(char const*, unsigned int) at ds-server.c:109))(char const*, unsigned int)) at ds-ipc.c:391:14 [opt]                                     
    frame #4: 0x000000010124d8d8 libcoreclr.dylib`server_thread(data=<unavailable>) at ds-server.c:126:34 [opt]                                                                                                                                                                               
    frame #5: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015d01b800) at thread.cpp:1862:16 [opt]                                                                                                    
    frame #6: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                                                                 
  thread #5, stop reason = signal SIGSTOP                                                                                                                                                                   
    frame #0: 0x000000018e10d8d8 libsystem_kernel.dylib`__open + 8                                                                                                                                                                            
    frame #1: 0x000000018e117bb4 libsystem_kernel.dylib`open + 64                                                                                                                                                                                                                             
    frame #2: 0x000000010114fc30 libcoreclr.dylib`TwoWayPipe::WaitForConnection(this=0x000000015c819dd0) at twowaypipe.cpp:87:21 [opt]                                                                                                                                                        
    frame #3: 0x000000010114827c libcoreclr.dylib`DbgTransportSession::TransportWorker(this=<unavailable>) at dbgtransportsession.cpp:1319:25 [opt]                                                                                                                                           
    frame #4: 0x0000000101146e1c libcoreclr.dylib`DbgTransportSession::TransportWorkerStatic(pvContext=<unavailable>) at dbgtransportsession.cpp:1235:40 [opt]                                                                                
    frame #5: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015c81a200) at thread.cpp:1862:16 [opt]                                                                                                    
    frame #6: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                 
  thread #6, stop reason = signal SIGSTOP                                                                                                                                                                                                                                                     
    frame #0: 0x000000018e1105c8 libsystem_kernel.dylib`__psynch_cvwait + 8                                                                                                                                                                   
    frame #1: 0x000000018e147508 libsystem_pthread.dylib`_pthread_cond_wait + 1248                                                                                                                                                                                                            
    frame #2: 0x0000000100ce61f4 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000015d01c238, dwTimeout=4294967295, ptwrWakeupReason=0x000000016f7babdc, pdwSignaledObject=0x000000016f7babd8) at synchmanager.cpp:478:28 [opt]                                                                                                                                                                                                           
    frame #3: 0x0000000100ce59d4 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x000000015d014600, pthrCurrent=0x000000015d01c000, dwTimeout=4294967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x000000016f7bac78, pdwSignaledObject=0x000000016f7bac7c) at synchmanager.cpp:301:22 [opt]                                                                                                                                                     
    frame #4: 0x0000000100cee30c libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000015d01c000, nCount=<unavailable>, lpHandles=0x000000016f7baeb8, bWaitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]                                                                                                                                                                             
    frame #5: 0x0000000100ceebd0 libcoreclr.dylib`::WaitForMultipleObjectsEx(nCount=3, lpHandles=0x000000016f7baeb8, bWaitAll=NO, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:201:13 [opt]                                          
    frame #6: 0x0000000101143c70 libcoreclr.dylib`DebuggerRCThread::MainLoop(this=0x000000015c815810) at rcthread.cpp:970:30 [opt]                                                                                                                                                            
    frame #7: 0x0000000101143a7c libcoreclr.dylib`DebuggerRCThread::ThreadProc(this=0x000000015c815810) at rcthread.cpp:775:5 [opt]                                                                                                           
    frame #8: 0x000000010114360c libcoreclr.dylib`DebuggerRCThread::ThreadProcStatic((null)=<unavailable>) at rcthread.cpp:1359:8 [opt]                                                                                                                                                       
    frame #9: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015d01c000) at thread.cpp:1862:16 [opt]                                                                                                                                                    
    frame #10: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                                                                
  thread #7, stop reason = signal SIGSTOP                                                                                                                                                                                                                                                     
    frame #0: 0x000000018e1105c8 libsystem_kernel.dylib`__psynch_cvwait + 8                                                                                                                                                                                                                   
    frame #1: 0x000000018e147508 libsystem_pthread.dylib`_pthread_cond_wait + 1248                                                                                                                                                                                                            
    frame #2: 0x0000000100ce61f4 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000015c82ae38, dwTimeout=4294967295, ptwrWakeupReason=0x000000016f8469fc, pdwSignaledObject=0x000000016f8469f8) at synchmanager.cpp:478:28 [opt]             
    frame #3: 0x0000000100ce59d4 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x000000015d014600, pthrCurrent=0x000000015c82ac00, dwTimeout=4294967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x000000016f846a98, pdwSignaledObject=0x000000016f846a9c) at synchmanager.cpp:301:22 [opt]                                                                                                                                                     
    frame #4: 0x0000000100cee30c libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000015c82ac00, nCount=<unavailable>, lpHandles=0x00000001013ee3b0, bWaitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]                                                                                                                                                                             
    frame #5: 0x0000000100ceebd0 libcoreclr.dylib`::WaitForMultipleObjectsEx(nCount=1, lpHandles=0x00000001013ee3b0, bWaitAll=NO, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:201:13 [opt]                                                                                          
    frame #6: 0x0000000100f1344c libcoreclr.dylib`FinalizerThread::WaitForFinalizerEvent(event=0x000000015c00e010) at finalizerthread.cpp:173:17 [opt]                                                                                                                                        
    frame #7: 0x0000000100f136f4 libcoreclr.dylib`FinalizerThread::FinalizerThreadWorker(args=<unavailable>) at finalizerthread.cpp:258:9 [opt]                                                                                                                                               
    frame #8: 0x0000000100e773ec libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7300:5 [opt]                                                                                    
    frame #9: 0x0000000100e773e4 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7344 [opt]                                                                                                                                                          
    frame #10: 0x0000000100e772dc libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7502 [opt]                                                                                                                            
    frame #11: 0x0000000100e772d8 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7504 [opt]                                                                                                                                                         
    frame #12: 0x0000000100e7728c libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x000000016f846ef8) at threads.cpp:7528 [opt]                                                                                                                                                   
    frame #13: 0x0000000100e77aa4 libcoreclr.dylib`ManagedThreadBase::FinalizerBase(void (*)(void*)) [inlined] ManagedThreadBase_NoADTransition(pTarget=<unavailable>, filterType=FinalizerThread)(void*), UnhandledExceptionLocation) at threads.cpp:7572:5 [opt]                            
    frame #14: 0x0000000100e77a90 libcoreclr.dylib`ManagedThreadBase::FinalizerBase(pTarget=<unavailable>)(void*)) at threads.cpp:7598 [opt]                                                                                                                                                  
    frame #15: 0x0000000100f13af4 libcoreclr.dylib`FinalizerThread::FinalizerThreadStart(args=<unavailable>) at finalizerthread.cpp:377:17 [opt]                                                                                                                                              
    frame #16: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015c82ac00) at thread.cpp:1862:16 [opt]                                                                                                                                                   
    frame #17: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                    
  thread #8, stop reason = signal SIGSTOP                                                                                                                                         
    frame #0: 0x000000018e1105c8 libsystem_kernel.dylib`__psynch_cvwait + 8                                                                                                       
    frame #1: 0x000000018e147508 libsystem_pthread.dylib`_pthread_cond_wait + 1248                                                                                                
    frame #2: 0x0000000100ce61f4 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000015c830438, dwTimeout=4294967295, ptwrWakeupReason=0x000000016f8d27ac, pdwSignaledObject=0x000000016f8d27a8) at synchmanager.cpp:478:28 [opt]                                                                                                                                                                                                           
    frame #3: 0x0000000100ce59d4 libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x000000015d014600, pthrCurrent=0x000000015c830200, dwTimeout=4294967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x000000016f8d2848, pdwSignaledObject=0x000000016f8d284c) at synchmanager.cpp:301:22 [opt]                              
    frame #4: 0x0000000100cee30c libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000015c830200, nCount=<unavailable>, lpHandles=0x000000016f8d29e8, bWaitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]                                                                                                                                                                             
    frame #5: 0x0000000100cee938 libcoreclr.dylib`::WaitForSingleObjectEx(hHandle=0x000000000000008c, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:138:13 [opt]                                                                      
    frame #6: 0x0000000100fe453c libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper2(handle=<unavailable>, dwMilliseconds=<unavailable>, alertable=<unavailable>) at synch.cpp:372:12 [opt]                                                                                                                                                                                                                                           
    frame #7: 0x0000000100fe4538 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper(this=<unavailable>, pParam=0x000000016f8d2a68)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:397 [opt]                                                                                                                                                                                                
    frame #8: 0x0000000100fe4530 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at synch.cpp:399 [opt]                                               
    frame #9: 0x0000000100fe44d4 libcoreclr.dylib`CLREventBase::WaitEx(this=<unavailable>, dwMilliseconds=<unavailable>, mode=WaitMode_None, syncState=0x0000000000000000) at synch.cpp:466 [opt]                                             
    frame #10: 0x0000000100fea114 libcoreclr.dylib`Thread::RareDisablePreemptiveGC(this=0x000000015c82f210) at threadsuspend.cpp:2154:62 [opt]                                                                                                                                                                                                                       
    frame #11: 0x0000000100d14050 libcoreclr.dylib`GCHolderBase::EnterInternalCoop(this=0x000000016f8d2c78, pThread=0x000000015c82f210, conditional=true) at threads.h:5346:23 [opt]                                                                                                                                                                                 
    frame #12: 0x0000000100d13f98 libcoreclr.dylib`GCCoop::GCCoop(this=0x000000016f8d2c78) at threads.h:5466:15 [opt]                                                                                                                                                                                                                                                
    frame #13: 0x0000000100d03a0c libcoreclr.dylib`GCCoop::GCCoop(this=<unavailable>) at threads.h:5461:5 [opt] [artificial]                                                                                                                                                                                                                                         
    frame #14: 0x0000000100e7ca48 libcoreclr.dylib`TieredCompilationManager::TryDeactivateTieringDelay(this=0x000000015c81ff48) at tieredcompilation.cpp:599:9 [opt]              
    frame #15: 0x0000000100e7c73c libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerStart(this=0x000000015c81ff48) at tieredcompilation.cpp:480:23 [opt]                 
    frame #16: 0x0000000100e7c5c0 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerBootstrapper1((null)=<unavailable>) at tieredcompilation.cpp:431:52 [opt]            
    frame #17: 0x0000000100e773ec libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7300:5 [opt]                                   
    frame #18: 0x0000000100e773e4 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7344 [opt]                                             
    frame #19: 0x0000000100e772dc libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7502 [opt]                                                                                                                            
    frame #20: 0x0000000100e772d8 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7504 [opt]                                                                                                                                                                                                                                
    frame #21: 0x0000000100e7728c libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x000000016f8d2f38) at threads.cpp:7528 [opt]                                       
    frame #22: 0x0000000100e77a48 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, args=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7548:5 [opt]                                                                                                                                                                                                
    frame #23: 0x0000000100e77a34 libcoreclr.dylib`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7583 [opt]                                                                             
    frame #24: 0x0000000100e7c534 libcoreclr.dylib`TieredCompilationManager::BackgroundWorkerBootstrapper0(args=0x000000015c82f210) at tieredcompilation.cpp:412:5 [opt]                                                                      
    frame #25: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015c830200) at thread.cpp:1862:16 [opt]                                       
    frame #26: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                    
  thread #9, stop reason = signal SIGSTOP                                                                                                                                                                                                                                                                                                                            
    frame #0: 0x000000018e10da3c libsystem_kernel.dylib`read + 8                                                                                                                                                                                                                                                                                                     
    frame #1: 0x0000000100a98eec libSystem.Native.dylib`SignalHandlerLoop + 96                                                                                                                                                                
    frame #2: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                 
* thread #10, stop reason = signal SIGSTOP                                                                                                                                                                                                    
  * frame #0: 0x00000002812bdb14                 // Managed TestCSE.Flip()                                                                                                                                                                                             
    frame #1: 0x00000002812bd9ec                                                                                                                                                                                                              
    frame #2: 0x00000002812bd918                                                                                                                                                                                                              
    frame #3: 0x00000002812bd870                                                                                                                                                                                                                                                                                                                                                                                                                                                            
    frame #4: 0x00000001010c0d0c libcoreclr.dylib`CallDescrWorkerInternal at calldescrworkerarm64.S:71                                                                                                                                        
    frame #5: 0x0000000100ec225c libcoreclr.dylib`CallDescrWorkerWithHandler(pCallDescrData=0x000000016f9eac30, fCriticalCall=<unavailable>) at callhelpers.cpp:71:5 [opt]                                                                    
    frame #6: 0x0000000100ec2544 libcoreclr.dylib`DispatchCallSimple(pSrc=0x000000016f9ead08, numStackSlotsToCopy=0, pTargetAddress=10751386376, dwDispatchCallSimpleFlags=0) at callhelpers.cpp:224:9 [opt]                                  
    frame #7: 0x0000000100ed9c34 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(ptr=<unavailable>) at comsynchronizable.cpp:157:5 [opt]                                                                                                  
    frame #8: 0x0000000100e773ec libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7300:5 [opt]                                    
    frame #9: 0x0000000100e773e4 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7344 [opt]                                                                                                          
    frame #10: 0x0000000100e772dc libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7502 [opt]                                                                                                                            
    frame #11: 0x0000000100e772d8 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7504 [opt]                                                                                                         
    frame #12: 0x0000000100e7728c libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x000000016f9eaef8) at threads.cpp:7528 [opt]                                                                                                   
    frame #13: 0x0000000100e77a48 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, args=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7548:5 [opt]                                                                                                                                                                                                
    frame #14: 0x0000000100e77a34 libcoreclr.dylib`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7583 [opt]                                                                             
    frame #15: 0x0000000100ed9de8 libcoreclr.dylib`ThreadNative::KickOffThread(pass=0x000000015c83d010) at comsynchronizable.cpp:228:9 [opt]                                                                                                  
    frame #16: 0x0000000100cf9d44 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000015d020e00) at thread.cpp:1862:16 [opt]                                                                                                   
    frame #17: 0x000000018e146fd4 libsystem_pthread.dylib`_pthread_start + 320                                                                                                                                                                

/cc @janvorli

Also affects

JIT/Methodical/tailcall_v4/hijacking/hijacking.sh
JIT/Regression/CLR-x86-JIT/V1-M13-RTM/b99969/b99969/b99969.sh
sdmaclea commented 3 years ago

I will disable these in #50117

janvorli commented 3 years ago

@sdmaclea is the thread 10 the one with the OS id 0xebf02? The one that is marked as thread 4 in the managed thread list?

janvorli commented 3 years ago

Looking at the Flip method, I wonder if there is a possibility of missing memory barrier in the loop or some bad optimization that would not honor the fact that the s_timeUp is volatile. Can you please share the disass of that method?

sdmaclea commented 3 years ago
clru -gcinfo 0x00000002812bdb14                                                                       
Normal JIT generated code                                                                             
TestCSE.Flip()                                                                                        
ilAddr is 0000000100A64538 pImport is 0000000142E29BC0
Begin 00000002812BDAD8, size b8

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 211:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: Fp
Has tailcalls: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: b8
00000002812bdad8 fd7bbfa9             stp     x29, x30, [sp, #-0x10]!
00000002812bdadc fd030091             mov     x29, sp
00000008 interruptible
00000008 +X0
00000002812bdae0 01008052             mov     w1, #0x0
00000002812bdae4 62ce90d2             mov     x2, #0x8673
00000002812bdae8 421fb0f2             movk    x2, #0x80fa, lsl #16
00000002812bdaec 4200c0f2             movk    x2, #0x2, lsl #32
00000002812bdaf0 43fcdf08             ldarb   w3, [x2]
00000002812bdaf4 43040035             cbnz    w3, 0x2812bdb7c

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 213:
00000002812bdaf8 039893d2             mov     x3, #0x9cc0
00000002812bdafc 8330a3f2             movk    x3, #0x1984, lsl #16
00000002812bdb00 2300c0f2             movk    x3, #0x1, lsl #32
00000002812bdb04 640040f9             ldr     x4, [x3]
00000030 +X4
00000002812bdb08 257c0153             lsr     w5, w1, #1
00000002812bdb0c a5781f53             lsl     w5, w5, #1
00000002812bdb10 2500054b             sub     w5, w1, w5
>>> 00000002812bdb14 860840b9             ldr     w6, [x4, #0x8]
00000002812bdb18 bf00066b             cmp     w5, w6
00000002812bdb1c 62030054             b.hs    0x2812bdb88
00000002812bdb20 a67c4093             sxtw    x6, w5
00000002812bdb24 c6f47ed3             lsl     x6, x6, #2
00000002812bdb28 070280d2             mov     x7, #0x10
00000002812bdb2c c600078b             add     x6, x6, x7
00000002812bdb30 846866b8             ldr     w4, [x4, x6]
0000005c -X4
00000002812bdb34 bf3b03d5             dmb     ish
00000002812bdb38 040800b9             str     w4, [x0, #0x8]

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 215:
00000002812bdb3c 630040f9             ldr     x3, [x3]
00000068 +X3
00000002812bdb40 24008052             mov     w4, #0x1
00000002812bdb44 a400040b             add     w4, w5, w4
00000002812bdb48 650840b9             ldr     w5, [x3, #0x8]
00000002812bdb4c 9f00056b             cmp     w4, w5
00000002812bdb50 c2010054             b.hs    0x2812bdb88
00000002812bdb54 847c4093             sxtw    x4, w4
00000002812bdb58 84f47ed3             lsl     x4, x4, #2
00000002812bdb5c 8400078b             add     x4, x4, x7
00000002812bdb60 636864b8             ldr     w3, [x3, x4]
0000008c -X3
00000002812bdb64 bf3b03d5             dmb     ish
00000002812bdb68 030c00b9             str     w3, [x0, #0xc]

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 211:
00000002812bdb6c 23008052             mov     w3, #0x1
00000002812bdb70 2100030b             add     w1, w1, w3
00000002812bdb74 43fcdf08             ldarb   w3, [x2]
00000002812bdb78 03fcff34             cbz     w3, 0x2812bdaf8

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 217:
000000a4 not interruptible
000000a4 -X0
00000002812bdb7c fd7bc1a8             ldp     x29, x30, [sp], #0x10
00000002812bdb80 c0035fd6             ret     
000000ac interruptible
00000002812bdb84 e00ddcba             .long   0xbadc0de0
00000002812bdb88 e00ddcba             .long   0xbadc0de0
00000002812bdb8c e00ddcba             .long   0xbadc0de0
000000b8 not interruptible
sdmaclea commented 3 years ago

@sdmaclea is the thread 10 the one with the OS id 0xebf02? The one that is marked as thread 4 in the managed thread list?

That would be my guess. Per MikeM, that info is missing from the MachO core dumps.

sdmaclea commented 3 years ago
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 68:                         
Prolog size: 0                                                                                        
Security object: <none>                                                                               
GS cookie: <none>                                                                                     
PSPSym: <none>                                                                                        
Generics inst context: <none>                                                                         
PSP slot: <none>                                                                                      
GenericInst slot: <none>                                                                              
Varargs: 0                                                                                            
Frame pointer: Fp                                                                                     
Has tailcalls: 0                                                                                      
Size of parameter area: 0                                                                             
Return Kind: Scalar                                                                                   
Code size: 294
00000002812bdba8 fd7bbba9             stp     x29, x30, [sp, #-0x50]!
00000002812bdbac f3d301a9             stp     x19, x20, [sp, #0x18]                                   
00000002812bdbb0 f5db02a9             stp     x21, x22, [sp, #0x28]
00000002812bdbb4 f7e303a9             stp     x23, x24, [sp, #0x38]                                   
00000002812bdbb8 f92700f9             str     x25, [sp, #0x48]                                        
00000002812bdbbc fd030091             mov     x29, sp   
00000002812bdbc0 f30300aa             mov     x19, x0                                                 
0000001c interruptible                                                                                
0000001c +X19                                                                                         
00000002812bdbc4 00008052             mov     w0, #0x0  
00000002812bdbc8 ae000094             bl      0x2812bde80                                             

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 71:                         
00000002812bdbcc 80808052             mov     w0, #0x404
00000002812bdbd0 bf3b03d5             dmb     ish                                                     
00000002812bdbd4 600a00b9             str     w0, [x19, #0x8]                 

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 72:
00000002812bdbd8 bf3b03d5             dmb     ish                                                     
00000002812bdbdc 600e00b9             str     w0, [x19, #0xc]                 

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 73:                         
00000002812bdbe0 140080d2             mov     x20, #0x0 

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 79:
00000002812bdbe4 600a40b9             ldr     w0, [x19, #0x8]
00000002812bdbe8 bf3903d5             dmb     ishld 
00000002812bdbec 610e40b9             ldr     w1, [x19, #0xc]
00000002812bdbf0 bf3903d5             dmb     ishld 
00000002812bdbf4 0000014b             sub     w0, w0, w1
00000002812bdbf8 610a40b9             ldr     w1, [x19, #0x8]
00000002812bdbfc bf3903d5             dmb     ishld 
00000002812bdc00 620e40b9             ldr     w2, [x19, #0xc]
00000002812bdc04 bf3903d5             dmb     ishld 
00000002812bdc08 2100024b             sub     w1, w1, w2
00000002812bdc0c 1f00016b             cmp     w0, w1
00000002812bdc10 21020054             b.ne    0x2812bdc54

/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 80:
00000002812bdc14 600a40b9             ldr     w0, [x19, #0x8]
00000002812bdc18 bf3903d5             dmb     ishld 
00000002812bdc1c 610e40b9             ldr     w1, [x19, #0xc]
00000002812bdc20 bf3903d5             dmb     ishld 
00000002812bdc24 0000014b             sub     w0, w0, w1
>>> 00000002812bdc28 e00ddcba             .long   0xbadc0de0
00000002812bdc2c e00ddcba             .long   0xbadc0de0
00000002812bdc30 e00ddcba             .long   0xbadc0de0
00000002812bdc34 e00ddcba             .long   0xbadc0de0
00000002812bdc38 e00ddcba             .long   0xbadc0de0
00000002812bdc3c e00ddcba             .long   0xbadc0de0
00000002812bdc40 e00ddcba             .long   0xbadc0de0
...
...
...
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 205:                        
00000278 -X0                                                                                          
00000002812bde20 e00ddcba             .long   0xbadc0de0                                              
0000027c not interruptible                                                                            
00000002812bde24 f92740f9             ldr     x25, [sp, #0x48]                                        
00000002812bde28 f7e343a9             ldp     x23, x24, [sp, #0x38]                                   
00000002812bde2c f5db42a9             ldp     x21, x22, [sp, #0x28]
00000002812bde30 f3d341a9             ldp     x19, x20, [sp, #0x18]                                   
00000002812bde34 fd7bc5a8             ldp     x29, x30, [sp], #0x50                                   
00000002812bde38 c0035fd6             ret                                                             

In case it is helpful, this is Thread 1 too.

janvorli commented 3 years ago

Ah, I have missed that you've mentioned that the method is interruptible. Then it should not prevent GC suspension even if there was the issue I was thinking about. Then I wonder if there is a possible bug in the suspension of interruptible code. Either the g_safeActivationCheckFunction call returning false (the actual function is CheckActivationSafePoint) or the HandleGCSuspensionForInterruptedThread not working properly for interruptible code. The GC mode of the thread 10 is correct, as it is running managed code, so it needs to be cooperative.

Have you tried to repro the issue locally?

sdmaclea commented 3 years ago

I just managed to get the issue to repro in lldb. (It doesn't fail as often in lldb.)

I can confirm the thread 10 is the fourth managed thread

From the live process...

(lldb) clrthreads
ThreadCount:      4
UnstartedThread:  0
BackgroundThread: 2
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1    f86ee 0000000102818610    20020 Preemptive  0000000000000000:0000000000000000 0000000102808210 -00001 Ukn (GC) 
   7    2    f8712 0000000102821A10    21220 Preemptive  0000000000000000:0000000000000000 0000000102808210 -00001 Ukn (Finalizer) 
   8    3    f8714 0000000102827410    21220 Preemptive  0000000000000000:0000000000000000 0000000102808210 -00001 Ukn 
  10    4    f8718 000000010102DC10    21022 Cooperative 0000000000000000:0000000000000000 0000000102808210 -00001 Ukn 
(lldb) threads
*1 0xF86EE (1017582)
 2 0xF870D (1017613)
 3 0xF870E (1017614)
 4 0xF870F (1017615)
 5 0xF8710 (1017616)
 6 0xF8711 (1017617)
 7 0xF8712 (1017618)
 8 0xF8714 (1017620)
 9 0xF8717 (1017623)
 10 0xF8718 (1017624)
(lldb) thread select 10
* thread #10
    frame #0: 0x000000028132db48
->  0x28132db48: ldr    w5, [x3, #0x8]
    0x28132db4c: cmp    w4, w5
    0x28132db50: b.hs   0x28132db88
    0x28132db54: sxtw   x4, w4
(lldb) clrstack
OS Thread Id: 0xf8718 (10)
        Child SP               IP Call Site
00000001702E6AF0 000000028132DB48 TestCSE.Flip() [/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 215]
00000001702E6B00 000000028132D9EC System.Threading.Thread+StartHelper.RunWorker()
00000001702E6B50 000000028132D918 System.Threading.Thread+StartHelper.Run()
00000001702E6B80 000000028132D870 System.Threading.Thread.StartCallback()
00000001702E6DA0 0000000101c40d0c [DebuggerU2MCatchHandlerFrame: 00000001702e6da0] 
sdmaclea commented 3 years ago

I have spent some time debugging this. The root cause is that the SIGUSR1 signals are not being received reliably.

There are at least two issues.

I set a breakpoint here https://github.com/dotnet/runtime/blob/1adf9adab7f128becc9cd3550f4a604f87d9a7bc/src/coreclr/pal/src/exception/signal.cpp#L747

And it was hit.

Looking at the conditions for https://github.com/dotnet/runtime/blob/1adf9adab7f128becc9cd3550f4a604f87d9a7bc/src/coreclr/pal/src/exception/signal.cpp#L716-L722

The code shouldn't have ever reached 747. g_activationFunction was set appropriately and siginfo->si_pid == 0. I think there is a race between signal delivery and signaling.

If I modify the condition to ignore siginfo->si_pid the CheckActivationSafePoint and HandleGCSuspensionForInterruptedThread are sometimes hit.

However, the signal eventually stop being delivered. Apparently the kernel is not very tolerant of flooding of signals.

18b554  40.834417250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554  40.833156917 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554  40.831898292 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554  40.830643334 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554  40.829385750 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554  40.828128709 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554  40.826851584 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
...
...
...
18b554   0.840398834 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554   0.839140875 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554   0.837882500 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554   0.836622959 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554   0.835364334 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554   0.834083209 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554   0.833838792 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b560   0.833715125 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.833520250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.833319542 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.832895875 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.832688875 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.832467459 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.832268250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.832044042 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.831842250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.831638834 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.831427875 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.831233042 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.830820750 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.830587334 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.830365250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.830125584 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.829898167 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.829700709 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.829482125 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.829261292 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.829037042 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.828849125 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.828648667 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.828414584 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.828209209 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.827932834 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b560   0.825339709 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.825135625 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b560   0.824734084 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560   0.824184250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.823864042 : `SYNC`               HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
18b554   0.823863625 : `SYNC`               CheckActivationSafePoint ip=0x28126d5b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b554   0.823863417 : `SYNC`               CheckActivationSafePoint ip=0x28126d5b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b560   0.823855625 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.823281375 : `SYNC`               CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b560   0.823259000 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.553450792 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554   0.538785000 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554   0.537529792 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554   0.537142417 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b55b   0.536811375 : `SYNC`               HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x156014010
18b55b   0.536810917 : `SYNC`               CheckActivationSafePoint ip=0x280d66a40 current=1 thread=0x156014010 cfmc=1 r=1
18b55b   0.536810834 : `SYNC`               CheckActivationSafePoint ip=0x280d66a40 current=1 thread=0x156014010 cfmc=1 r=1
18b554   0.536806375 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554   0.536651500 : `SYNC`               HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
18b554   0.536651042 : `SYNC`               CheckActivationSafePoint ip=0x280d656b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b554   0.536650917 : `SYNC`               CheckActivationSafePoint ip=0x280d656b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b55b   0.536644750 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.536438375 : `SYNC`               CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b   0.536435917 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.536217667 : `SYNC`               CheckActivationSafePoint ip=0x18e159638 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b   0.536214459 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.535858250 : `SYNC`               CheckActivationSafePoint ip=0x1013e08c8 current=1 thread=0x15600ac10 cfmc=0 r=0
18b55b   0.535855959 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.535384167 : `SYNC`               CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b   0.535376084 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.535165125 : `SYNC`               CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b   0.535161000 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b55b   0.534950667 : `SYNC`               CheckActivationSafePoint ip=0x1017ab9b0 current=1 thread=0x156014010 cfmc=1 r=0
18b554   0.534948625 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b55b   0.534506709 : `SYNC`               CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x156014010 cfmc=1 r=0
18b554   0.534502584 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554   0.534144209 : `SYNC`               HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
18b554   0.534143875 : `SYNC`               CheckActivationSafePoint ip=0x280d6568c current=1 thread=0x15600ac10 cfmc=1 r=1
18b554   0.534143792 : `SYNC`               CheckActivationSafePoint ip=0x280d6568c current=1 thread=0x15600ac10 cfmc=1 r=1
18b55b   0.534139334 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.533924834 : `SYNC`               CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b   0.533915375 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
janvorli commented 3 years ago

There is another strange thing - CheckActivationSafePoint being called repeatedly for the same IP at few places within few microseconds from each other. It looks as if the SIGUSR1 was not blocked for re-entering the handler, as it should. Unless I am missing something, by default, the signal being handled should be blocked until the signal handler returns or the signal is explicitly re-enabled using pthread_sigmask.

I guess it might be the core of the problem. I would suggest trying the following:

sdmaclea commented 3 years ago

So the second call to CheckActivationSafePoint occurs when its result is true and HandleGCSuspensionForInterruptedThread is called. Because of the _ASSERTE(CheckActivationSafePoint(ip, /* checkingCurrentThread */ TRUE)); in HandleGCSuspensionForInterruptedThread.

janvorli commented 3 years ago

Ah, ok, I've forgotten about that assert.

sdmaclea commented 3 years ago

I've incrementally tried a few more things.

I did not see any particular improvement.

I don't have any other ideas on how to work around/fix this issue.

I pushed the last debug code to https://github.com/sdmaclea/runtime/tree/debug50381.

We probably need to report this to Apple and/or revert #46657.

I think there are a few other tests failing with similar symptoms. I will look at these then start looking at other unrelated failures.

janvorli commented 3 years ago

Hmm, looking at the events you've logged again, it seems as if the thread has entered the suspension handler and stayed there, but the thread that runs the suspension didn't detect it and kept trying to send the signal. That would explain why the signal stops being delivered.

.
.
.
18b560   0.824184250 : `SYNC`               InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554   0.823864042 : `SYNC`               HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10

I would suggest adding a stress log in HandleGCSuspensionForInterruptedThread after the following:

pThread->PulseGCMode();

That would prove whether we are sitting in the HandleGCSuspensionForInterruptedThread which means properly suspended or not.

sdmaclea commented 3 years ago

In the core dumps the thread we are trying to interrupt shows up as still running managed code.

I created a feedback ticket to Apple.

k15tfu commented 3 years ago

@sdmaclea @janvorli Hi! Do you have any updates on this from Apple?

sdmaclea commented 3 years ago

I didn't get any feedback from Apple.

mangod9 commented 3 years ago

@janvorli, assume this is intermittent even with GCStress? If so should move to 7.

christianscheuer commented 2 years ago

Hi @janvorli,

I'm not sure if this is the same issue (I hope, if not please ignore), but our production code (450k lines app) runs into this on every single launch after upgrading to .NET6 from 3.1. Our code makes quite extensive use of threading, tasks etc. It appears that when the first GC is run the SuspendRuntime activation injection issue gets triggered (which for our app happens within the first few seconds of launching).

For release builds we build with latest NativeAOT 7 alpha bits and we don't see any errors at runtime there using the NativeAOT compiler/runtime. But during development, we use CoreCLR, and here, as mentioned, the error occurs – not intermittently but on every launch. We're using the official .NET 6 SDK 6.0.100 osx-x64.

Please note that this is running x64 code via Rosetta on an M1 chip. We're using the x64 dotnet SDK since our code is relying on external dynamic libs with x64 code.

Thread 29 Crashed:: .NET ThreadPool Worker
0   ???                                 0x7ff89478e940 ???
1   libsystem_kernel.dylib              0x7ff8042d3112 __pthread_kill + 10
2   libsystem_c.dylib                   0x7ff804255d10 abort + 123
3   libcoreclr.dylib                       0x10b0b78db PROCAbort + 43
4   libcoreclr.dylib                       0x10b08332c InjectActivationInternal(CorUnix::CPalThread*) + 44
5   libcoreclr.dylib                       0x10b0bdb0b PAL_InjectActivation + 235
6   libcoreclr.dylib                       0x10b2da55c ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON) + 492
7   libcoreclr.dylib                       0x10b2dc3b2 ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON) + 402
8   libcoreclr.dylib                       0x10b23c1b3 GCToEEInterface::SuspendEE(SUSPEND_REASON) + 35
9   libcoreclr.dylib                       0x10b37bdb6 WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 470
10  libcoreclr.dylib                       0x10b37e033 WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) + 739
11  libcoreclr.dylib                       0x10b3a4000 WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 160
12  libcoreclr.dylib                       0x10b2412b9 AllocateString(unsigned int) + 201
13  libcoreclr.dylib                       0x10b25efad FramedAllocateString(unsigned int) + 141

Is this useful? Right now this issue is completely blocking us from adopting .NET 6, so I'd be happy to try to help isolate this and work with you on troubleshooting it if we can.

janvorli commented 2 years ago

@christianscheuer this looks rather like #58111. A longer stack trace would likely confirm that. Do you have more stack frames logged?

christianscheuer commented 2 years ago

Hi @janvorli. Thanks so much for the quick reply. Hm if I read #58111 correctly, that's about this happening on a macOS dispatch queue thread, correct? If you notice the header of the stacktrace, this is happening on a .NET ThreadPool Worker thread. Or maybe I'm misunderstanding?

The full stack trace is here (unfortunately, right now unsymbolicated, I'll have to figure out how to locate those symbols). ``` Thread 29 Crashed:: .NET ThreadPool Worker 0 ??? 0x7ff89478e940 ??? 1 libsystem_kernel.dylib 0x7ff8042d3112 __pthread_kill + 10 2 libsystem_c.dylib 0x7ff804255d10 abort + 123 3 libcoreclr.dylib 0x10b0b78db PROCAbort + 43 4 libcoreclr.dylib 0x10b08332c InjectActivationInternal(CorUnix::CPalThread*) + 44 5 libcoreclr.dylib 0x10b0bdb0b PAL_InjectActivation + 235 6 libcoreclr.dylib 0x10b2da55c ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON) + 492 7 libcoreclr.dylib 0x10b2dc3b2 ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON) + 402 8 libcoreclr.dylib 0x10b23c1b3 GCToEEInterface::SuspendEE(SUSPEND_REASON) + 35 9 libcoreclr.dylib 0x10b37bdb6 WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 470 10 libcoreclr.dylib 0x10b37e033 WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) + 739 11 libcoreclr.dylib 0x10b3a4000 WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 160 12 libcoreclr.dylib 0x10b2412b9 AllocateString(unsigned int) + 201 13 libcoreclr.dylib 0x10b25efad FramedAllocateString(unsigned int) + 141 14 ??? 0x114eeef05 ??? 15 ??? 0x114ed9df1 ??? 16 ??? 0x114ed87b2 ??? 17 ??? 0x113737d92 ??? 18 ??? 0x11373699d ??? 19 ??? 0x114e90a82 ??? 20 ??? 0x1137359ee ??? 21 ??? 0x1133dca81 ??? 22 ??? 0x111902175 ??? 23 ??? 0x114ea4a69 ??? 24 ??? 0x1119022aa ??? 25 ??? 0x111902209 ??? 26 ??? 0x114e9385b ??? 27 ??? 0x114eabcd5 ??? 28 ??? 0x114e98e38 ??? 29 ??? 0x111902175 ??? 30 ??? 0x114ea4a69 ??? 31 ??? 0x1119022aa ??? 32 ??? 0x111902209 ??? 33 ??? 0x114e9385b ??? 34 ??? 0x114eeebb5 ??? 35 ??? 0x11192d3a6 ??? 36 ??? 0x11192d35d ??? 37 ??? 0x11373e53b ??? 38 ??? 0x111954e45 ??? 39 ??? 0x114ea4a69 ??? 40 ??? 0x111954f7a ??? 41 ??? 0x111954ed9 ??? 42 ??? 0x114e9385b ??? 43 ??? 0x114eeebb5 ??? 44 ??? 0x11192d3a6 ??? 45 ??? 0x11192d35d ??? 46 ??? 0x1137408e2 ??? 47 ??? 0x111954e45 ??? 48 ??? 0x114ea4a69 ??? 49 ??? 0x111954f7a ??? 50 ??? 0x111954ed9 ??? 51 ??? 0x114e9385b ??? 52 ??? 0x114eeebb5 ??? 53 ??? 0x11192a44f ??? 54 ??? 0x111789405 ??? 55 ??? 0x114e93a69 ??? 56 ??? 0x114eeebb5 ??? 57 ??? 0x11192d3a6 ??? 58 ??? 0x11192d35d ??? 59 ??? 0x114ef62c3 ??? 60 ??? 0x114f05fb5 ??? 61 ??? 0x114ea4a69 ??? 62 ??? 0x114f05d51 ??? 63 ??? 0x114f05b99 ??? 64 ??? 0x114ea4c61 ??? 65 ??? 0x114e9385b ??? 66 ??? 0x114eeebb5 ??? 67 ??? 0x11192acfa ??? 68 ??? 0x11192aa7a ??? 69 ??? 0x111789405 ??? 70 ??? 0x114e93a69 ??? 71 ??? 0x114eeebb5 ??? 72 ??? 0x11192d3a6 ??? 73 ??? 0x11192d35d ??? 74 ??? 0x114efb64e ??? 75 ??? 0x114f04b55 ??? 76 ??? 0x114ea4a69 ??? 77 ??? 0x114f048f1 ??? 78 ??? 0x114f04739 ??? 79 ??? 0x114ea4c61 ??? 80 ??? 0x114e9385b ??? 81 ??? 0x114eeebb5 ??? 82 ??? 0x11192acfa ??? 83 ??? 0x11192aa7a ??? 84 ??? 0x111789405 ??? 85 ??? 0x114e93a69 ??? 86 ??? 0x114eeebb5 ??? 87 ??? 0x11192d3a6 ??? 88 ??? 0x11192d35d ??? 89 ??? 0x114efc920 ??? 90 ??? 0x111954e45 ??? 91 ??? 0x114ea4a69 ??? 92 ??? 0x111954f7a ??? 93 ??? 0x111954ed9 ??? 94 ??? 0x114e9385b ??? 95 ??? 0x114eeebb5 ??? 96 ??? 0x11192d3a6 ??? 97 ??? 0x11192d35d ??? 98 ??? 0x114efea18 ??? 99 ??? 0x111954e45 ??? 100 ??? 0x114ea4a69 ??? 101 ??? 0x111954f7a ??? 102 ??? 0x111954ed9 ??? 103 ??? 0x114e9385b ??? 104 ??? 0x114eeebb5 ??? 105 ??? 0x11192d3a6 ??? 106 ??? 0x111832fc9 ??? 107 ??? 0x114effc80 ??? 108 ??? 0x114ea4a69 ??? 109 ??? 0x114eff6cf ??? 110 ??? 0x114eff189 ??? 111 ??? 0x114eb1030 ??? 112 ??? 0x111660edc ??? 113 ??? 0x114e8d5aa ??? 114 ??? 0x111783375 ??? 115 ??? 0x1117679fe ??? 116 libcoreclr.dylib 0x10b3aaa19 CallDescrWorkerInternal + 124 117 libcoreclr.dylib 0x10b1f551f DispatchCallSimple(unsigned long*, unsigned int, unsigned long, unsigned int) + 239 118 libcoreclr.dylib 0x10b20cd38 ThreadNative::KickOffThread_Worker(void*) + 136 119 libcoreclr.dylib 0x10b1bcf3e ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 318 120 libcoreclr.dylib 0x10b1bd4e0 ManagedThreadBase::KickOff(void (*)(void*), void*) + 32 121 libcoreclr.dylib 0x10b20ce0f ThreadNative::KickOffThread(void*) + 175 122 libcoreclr.dylib 0x10b0bc9a7 CorUnix::CPalThread::ThreadEntry(void*) + 407 123 libsystem_pthread.dylib 0x7ff804309514 _pthread_start + 125 124 libsystem_pthread.dylib 0x7ff80430502f thread_start + 15 ```
janvorli commented 2 years ago

The problem is actually not the thread that executes the activation injection, but rather the target thread. I guess you have another thread running managed code on a dispatch queue thread that's causing this problem. Current .NET main has a temporary fix that ignores the failure and lets the thread get suspended cooperatively (which may take a significant time). Based on the Apple doc on the dispatch queue threads, running managed code seems to be a bad idea anyways, as the dispatch threads have more limitations than the one we are hitting. So if it is really your problem and you can prevent it, it would be the best fix for your issue. I am planning to try to implement a different way for thread suspension for GC that would not have the problem with pthread_kill, however I am not sure how soon I'll have time to look into it.

christianscheuer commented 2 years ago

Hi Jan,

Thanks again for the quick follow up. I was just testing this and everything works well with .NET5 (as it also did in .NET 3.1). Unfortunately, we're very reliant on dispatch queue threads because of various Apple APIs we need to integrate with, and yes you're right the target thread is likely the issue:

Thread 11::  Dispatch queue: org.soundflow.sfsx.backend.automationqueue
0   ???                                 0x7ff89478e940 ???
1   libsystem_kernel.dylib              0x7ff8042cf506 __psynch_cvwait + 10
2   libcoreclr.dylib                       0x10b0af8db CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 315
3   libcoreclr.dylib                       0x10b0af5aa CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 458
4   libcoreclr.dylib                       0x10b0b3e2a CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1946
5   libcoreclr.dylib                       0x10b0b404d WaitForSingleObjectEx + 77
6   libcoreclr.dylib                       0x10b2d5118 CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 200
7   libcoreclr.dylib                       0x10b2d9b58 Thread::RareDisablePreemptiveGC() + 344
8   libcoreclr.dylib                       0x10b15b519 UnsafeJitFunction(PrepareCodeConfig*, COR_ILMETHOD_DECODER*, CORJIT_FLAGS, unsigned int*) + 1337
9   libcoreclr.dylib                       0x10b1940f4 MethodDesc::JitCompileCodeLocked(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*, unsigned int*, CORJIT_FLAGS*) + 308
10  libcoreclr.dylib                       0x10b193be6 MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*) + 630
11  libcoreclr.dylib                       0x10b193306 MethodDesc::JitCompileCode(PrepareCodeConfig*) + 454
12  libcoreclr.dylib                       0x10b192e87 MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 631
13  libcoreclr.dylib                       0x10b109095 CodeVersionManager::PublishVersionableCodeIfNecessary(MethodDesc*, CallerGCMode, bool*, bool*) + 517
14  libcoreclr.dylib                       0x10b195f94 MethodDesc::DoPrestub(MethodTable*, CallerGCMode) + 212
15  libcoreclr.dylib                       0x10b195a37 PreStubWorker + 535
16  libcoreclr.dylib                       0x10b3ab64b ThePreStub + 92
17  ???                                    0x114e67833 ???
18  ???                                    0x11374bf36 ???
19  ???                                    0x11374b30f ???
20  ???                                    0x11374b018 ???
21  ???                                    0x11374afc3 ???
22  ???                                    0x11374a088 ???
23  libdispatch.dylib                   0x7ff80414fcc9 _dispatch_client_callout + 8
24  libdispatch.dylib                   0x7ff804155cee _dispatch_lane_serial_drain + 696
25  libdispatch.dylib                   0x7ff8041567c8 _dispatch_lane_invoke + 366
26  libdispatch.dylib                   0x7ff8041607e1 _dispatch_workloop_worker_thread + 758
27  libsystem_pthread.dylib             0x7ff80430608f _pthread_wqthread + 326
28  libsystem_pthread.dylib             0x7ff80430501b start_wqthread + 15

Do you happen to know why this works (and has worked for 4+ years) reliably in .NET3.1 (both CoreCLR + CoreRT) and also appears to work reliably in .NET5 (CoreCLR) and works reliably in .NET6 (NativeAOT) but only not in CoreCLR .NET6?

Edit: Just to clarify this statement. The app we're migrating to .NET6 here has been used in production with CoreRT and used in debugging/development with CoreCLR on .NET3.1 since 2017 and never hitting any errors here.

janvorli commented 2 years ago

It is due to a change in activation injection that I've made in .NET 6. It was not using signals (and pthread_kill to send them) before. We were rather basically emulating async signal delivery to the thread by suspending the thread, manually modifying its context to redirect it to our suspension helper and then resuming the thread. But it turned out to be causing intermittent crashes when another process was sending async signals to the .NET process and the OS signal handling raced with the activation injection. The issue #44498 made us aware of the problem and in #46657 I've made the change.

To fix your problem, I'll try to get approval to port the #59045 that ignores the pthread_kill failure to .NET 6. But IIRC, the next update is not going to be out until February. So I wonder if you could use a modified .NET 6 runtime with that patch applied (just libcoreclr.dylib would need to be updated), built either by you or me. Except of not running .NET code on the dispatch queue threads, I don't see a workaround for the problem.

christianscheuer commented 2 years ago

Hi @janvorli,

That's awesome if you could get that approved! We can live with just upgrading to .NET5 for now and then wait until February (or whenever the next update comes out) to move on to .NET6. That timeframe is no problem at all.

I will definitely also do some research to see what it would take for us to move away from dispatch queue threads, but since we've been relying on them for so long, I am afraid I could be introducing subtle bugs if we do that too hastily, so it's nice to see if the scenario we have been relying on could be made to work again, so we can do any potential move away from DQ threads in our own time.

Really appreciate the quick & detailed responses. Thank you :)

davidwrighton commented 2 years ago

@christianscheuer, @janvorli's change was ported to the 6.0.2 release with #63393