Closed sdmaclea closed 2 years ago
I will disable these in #50117
@sdmaclea is the thread 10 the one with the OS id 0xebf02? The one that is marked as thread 4 in the managed thread list?
Looking at the Flip method, I wonder if there is a possibility of missing memory barrier in the loop or some bad optimization that would not honor the fact that the s_timeUp is volatile. Can you please share the disass of that method?
clru -gcinfo 0x00000002812bdb14
Normal JIT generated code
TestCSE.Flip()
ilAddr is 0000000100A64538 pImport is 0000000142E29BC0
Begin 00000002812BDAD8, size b8
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 211:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: Fp
Has tailcalls: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: b8
00000002812bdad8 fd7bbfa9 stp x29, x30, [sp, #-0x10]!
00000002812bdadc fd030091 mov x29, sp
00000008 interruptible
00000008 +X0
00000002812bdae0 01008052 mov w1, #0x0
00000002812bdae4 62ce90d2 mov x2, #0x8673
00000002812bdae8 421fb0f2 movk x2, #0x80fa, lsl #16
00000002812bdaec 4200c0f2 movk x2, #0x2, lsl #32
00000002812bdaf0 43fcdf08 ldarb w3, [x2]
00000002812bdaf4 43040035 cbnz w3, 0x2812bdb7c
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 213:
00000002812bdaf8 039893d2 mov x3, #0x9cc0
00000002812bdafc 8330a3f2 movk x3, #0x1984, lsl #16
00000002812bdb00 2300c0f2 movk x3, #0x1, lsl #32
00000002812bdb04 640040f9 ldr x4, [x3]
00000030 +X4
00000002812bdb08 257c0153 lsr w5, w1, #1
00000002812bdb0c a5781f53 lsl w5, w5, #1
00000002812bdb10 2500054b sub w5, w1, w5
>>> 00000002812bdb14 860840b9 ldr w6, [x4, #0x8]
00000002812bdb18 bf00066b cmp w5, w6
00000002812bdb1c 62030054 b.hs 0x2812bdb88
00000002812bdb20 a67c4093 sxtw x6, w5
00000002812bdb24 c6f47ed3 lsl x6, x6, #2
00000002812bdb28 070280d2 mov x7, #0x10
00000002812bdb2c c600078b add x6, x6, x7
00000002812bdb30 846866b8 ldr w4, [x4, x6]
0000005c -X4
00000002812bdb34 bf3b03d5 dmb ish
00000002812bdb38 040800b9 str w4, [x0, #0x8]
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 215:
00000002812bdb3c 630040f9 ldr x3, [x3]
00000068 +X3
00000002812bdb40 24008052 mov w4, #0x1
00000002812bdb44 a400040b add w4, w5, w4
00000002812bdb48 650840b9 ldr w5, [x3, #0x8]
00000002812bdb4c 9f00056b cmp w4, w5
00000002812bdb50 c2010054 b.hs 0x2812bdb88
00000002812bdb54 847c4093 sxtw x4, w4
00000002812bdb58 84f47ed3 lsl x4, x4, #2
00000002812bdb5c 8400078b add x4, x4, x7
00000002812bdb60 636864b8 ldr w3, [x3, x4]
0000008c -X3
00000002812bdb64 bf3b03d5 dmb ish
00000002812bdb68 030c00b9 str w3, [x0, #0xc]
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 211:
00000002812bdb6c 23008052 mov w3, #0x1
00000002812bdb70 2100030b add w1, w1, w3
00000002812bdb74 43fcdf08 ldarb w3, [x2]
00000002812bdb78 03fcff34 cbz w3, 0x2812bdaf8
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 217:
000000a4 not interruptible
000000a4 -X0
00000002812bdb7c fd7bc1a8 ldp x29, x30, [sp], #0x10
00000002812bdb80 c0035fd6 ret
000000ac interruptible
00000002812bdb84 e00ddcba .long 0xbadc0de0
00000002812bdb88 e00ddcba .long 0xbadc0de0
00000002812bdb8c e00ddcba .long 0xbadc0de0
000000b8 not interruptible
@sdmaclea is the thread 10 the one with the OS id 0xebf02? The one that is marked as thread 4 in the managed thread list?
That would be my guess. Per MikeM, that info is missing from the MachO core dumps.
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 68:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: Fp
Has tailcalls: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: 294
00000002812bdba8 fd7bbba9 stp x29, x30, [sp, #-0x50]!
00000002812bdbac f3d301a9 stp x19, x20, [sp, #0x18]
00000002812bdbb0 f5db02a9 stp x21, x22, [sp, #0x28]
00000002812bdbb4 f7e303a9 stp x23, x24, [sp, #0x38]
00000002812bdbb8 f92700f9 str x25, [sp, #0x48]
00000002812bdbbc fd030091 mov x29, sp
00000002812bdbc0 f30300aa mov x19, x0
0000001c interruptible
0000001c +X19
00000002812bdbc4 00008052 mov w0, #0x0
00000002812bdbc8 ae000094 bl 0x2812bde80
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 71:
00000002812bdbcc 80808052 mov w0, #0x404
00000002812bdbd0 bf3b03d5 dmb ish
00000002812bdbd4 600a00b9 str w0, [x19, #0x8]
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 72:
00000002812bdbd8 bf3b03d5 dmb ish
00000002812bdbdc 600e00b9 str w0, [x19, #0xc]
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 73:
00000002812bdbe0 140080d2 mov x20, #0x0
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 79:
00000002812bdbe4 600a40b9 ldr w0, [x19, #0x8]
00000002812bdbe8 bf3903d5 dmb ishld
00000002812bdbec 610e40b9 ldr w1, [x19, #0xc]
00000002812bdbf0 bf3903d5 dmb ishld
00000002812bdbf4 0000014b sub w0, w0, w1
00000002812bdbf8 610a40b9 ldr w1, [x19, #0x8]
00000002812bdbfc bf3903d5 dmb ishld
00000002812bdc00 620e40b9 ldr w2, [x19, #0xc]
00000002812bdc04 bf3903d5 dmb ishld
00000002812bdc08 2100024b sub w1, w1, w2
00000002812bdc0c 1f00016b cmp w0, w1
00000002812bdc10 21020054 b.ne 0x2812bdc54
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 80:
00000002812bdc14 600a40b9 ldr w0, [x19, #0x8]
00000002812bdc18 bf3903d5 dmb ishld
00000002812bdc1c 610e40b9 ldr w1, [x19, #0xc]
00000002812bdc20 bf3903d5 dmb ishld
00000002812bdc24 0000014b sub w0, w0, w1
>>> 00000002812bdc28 e00ddcba .long 0xbadc0de0
00000002812bdc2c e00ddcba .long 0xbadc0de0
00000002812bdc30 e00ddcba .long 0xbadc0de0
00000002812bdc34 e00ddcba .long 0xbadc0de0
00000002812bdc38 e00ddcba .long 0xbadc0de0
00000002812bdc3c e00ddcba .long 0xbadc0de0
00000002812bdc40 e00ddcba .long 0xbadc0de0
...
...
...
/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 205:
00000278 -X0
00000002812bde20 e00ddcba .long 0xbadc0de0
0000027c not interruptible
00000002812bde24 f92740f9 ldr x25, [sp, #0x48]
00000002812bde28 f7e343a9 ldp x23, x24, [sp, #0x38]
00000002812bde2c f5db42a9 ldp x21, x22, [sp, #0x28]
00000002812bde30 f3d341a9 ldp x19, x20, [sp, #0x18]
00000002812bde34 fd7bc5a8 ldp x29, x30, [sp], #0x50
00000002812bde38 c0035fd6 ret
In case it is helpful, this is Thread 1 too.
Ah, I have missed that you've mentioned that the method is interruptible. Then it should not prevent GC suspension even if there was the issue I was thinking about. Then I wonder if there is a possible bug in the suspension of interruptible code. Either the g_safeActivationCheckFunction call returning false (the actual function is CheckActivationSafePoint) or the HandleGCSuspensionForInterruptedThread not working properly for interruptible code. The GC mode of the thread 10 is correct, as it is running managed code, so it needs to be cooperative.
Have you tried to repro the issue locally?
I just managed to get the issue to repro in lldb. (It doesn't fail as often in lldb.)
I can confirm the thread 10 is the fourth managed thread
From the live process...
(lldb) clrthreads
ThreadCount: 4
UnstartedThread: 0
BackgroundThread: 2
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
1 1 f86ee 0000000102818610 20020 Preemptive 0000000000000000:0000000000000000 0000000102808210 -00001 Ukn (GC)
7 2 f8712 0000000102821A10 21220 Preemptive 0000000000000000:0000000000000000 0000000102808210 -00001 Ukn (Finalizer)
8 3 f8714 0000000102827410 21220 Preemptive 0000000000000000:0000000000000000 0000000102808210 -00001 Ukn
10 4 f8718 000000010102DC10 21022 Cooperative 0000000000000000:0000000000000000 0000000102808210 -00001 Ukn
(lldb) threads
*1 0xF86EE (1017582)
2 0xF870D (1017613)
3 0xF870E (1017614)
4 0xF870F (1017615)
5 0xF8710 (1017616)
6 0xF8711 (1017617)
7 0xF8712 (1017618)
8 0xF8714 (1017620)
9 0xF8717 (1017623)
10 0xF8718 (1017624)
(lldb) thread select 10
* thread #10
frame #0: 0x000000028132db48
-> 0x28132db48: ldr w5, [x3, #0x8]
0x28132db4c: cmp w4, w5
0x28132db50: b.hs 0x28132db88
0x28132db54: sxtw x4, w4
(lldb) clrstack
OS Thread Id: 0xf8718 (10)
Child SP IP Call Site
00000001702E6AF0 000000028132DB48 TestCSE.Flip() [/Users/stmaclea/git/runtime/src/tests/JIT/jit64/opt/cse/VolatileTest.cs @ 215]
00000001702E6B00 000000028132D9EC System.Threading.Thread+StartHelper.RunWorker()
00000001702E6B50 000000028132D918 System.Threading.Thread+StartHelper.Run()
00000001702E6B80 000000028132D870 System.Threading.Thread.StartCallback()
00000001702E6DA0 0000000101c40d0c [DebuggerU2MCatchHandlerFrame: 00000001702e6da0]
I have spent some time debugging this. The root cause is that the SIGUSR1
signals are not being received reliably.
There are at least two issues.
I set a breakpoint here https://github.com/dotnet/runtime/blob/1adf9adab7f128becc9cd3550f4a604f87d9a7bc/src/coreclr/pal/src/exception/signal.cpp#L747
And it was hit.
Looking at the conditions for https://github.com/dotnet/runtime/blob/1adf9adab7f128becc9cd3550f4a604f87d9a7bc/src/coreclr/pal/src/exception/signal.cpp#L716-L722
The code shouldn't have ever reached 747. g_activationFunction
was set appropriately and siginfo->si_pid == 0
. I think there is a race between signal delivery and signaling.
If I modify the condition to ignore siginfo->si_pid
the CheckActivationSafePoint
and HandleGCSuspensionForInterruptedThread
are sometimes hit.
However, the signal eventually stop being delivered. Apparently the kernel is not very tolerant of flooding of signals.
18b554 40.834417250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 40.833156917 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 40.831898292 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 40.830643334 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 40.829385750 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 40.828128709 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 40.826851584 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
...
...
...
18b554 0.840398834 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 0.839140875 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 0.837882500 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 0.836622959 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 0.835364334 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 0.834083209 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b554 0.833838792 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b560 0.833715125 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.833520250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.833319542 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.832895875 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.832688875 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.832467459 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.832268250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.832044042 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.831842250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.831638834 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.831427875 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.831233042 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.830820750 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.830587334 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.830365250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.830125584 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.829898167 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.829700709 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.829482125 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.829261292 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.829037042 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.828849125 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.828648667 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.828414584 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.828209209 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.827932834 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b560 0.825339709 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.825135625 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x155038210
18b560 0.824734084 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b560 0.824184250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.823864042 : `SYNC` HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
18b554 0.823863625 : `SYNC` CheckActivationSafePoint ip=0x28126d5b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b554 0.823863417 : `SYNC` CheckActivationSafePoint ip=0x28126d5b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b560 0.823855625 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.823281375 : `SYNC` CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b560 0.823259000 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.553450792 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554 0.538785000 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554 0.537529792 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554 0.537142417 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b55b 0.536811375 : `SYNC` HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x156014010
18b55b 0.536810917 : `SYNC` CheckActivationSafePoint ip=0x280d66a40 current=1 thread=0x156014010 cfmc=1 r=1
18b55b 0.536810834 : `SYNC` CheckActivationSafePoint ip=0x280d66a40 current=1 thread=0x156014010 cfmc=1 r=1
18b554 0.536806375 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554 0.536651500 : `SYNC` HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
18b554 0.536651042 : `SYNC` CheckActivationSafePoint ip=0x280d656b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b554 0.536650917 : `SYNC` CheckActivationSafePoint ip=0x280d656b8 current=1 thread=0x15600ac10 cfmc=1 r=1
18b55b 0.536644750 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.536438375 : `SYNC` CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b 0.536435917 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.536217667 : `SYNC` CheckActivationSafePoint ip=0x18e159638 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b 0.536214459 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.535858250 : `SYNC` CheckActivationSafePoint ip=0x1013e08c8 current=1 thread=0x15600ac10 cfmc=0 r=0
18b55b 0.535855959 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.535384167 : `SYNC` CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b 0.535376084 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.535165125 : `SYNC` CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b 0.535161000 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b55b 0.534950667 : `SYNC` CheckActivationSafePoint ip=0x1017ab9b0 current=1 thread=0x156014010 cfmc=1 r=0
18b554 0.534948625 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b55b 0.534506709 : `SYNC` CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x156014010 cfmc=1 r=0
18b554 0.534502584 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x156014010
18b554 0.534144209 : `SYNC` HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
18b554 0.534143875 : `SYNC` CheckActivationSafePoint ip=0x280d6568c current=1 thread=0x15600ac10 cfmc=1 r=1
18b554 0.534143792 : `SYNC` CheckActivationSafePoint ip=0x280d6568c current=1 thread=0x15600ac10 cfmc=1 r=1
18b55b 0.534139334 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.533924834 : `SYNC` CheckActivationSafePoint ip=0x1013f1450 current=1 thread=0x15600ac10 cfmc=1 r=0
18b55b 0.533915375 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
There is another strange thing - CheckActivationSafePoint
being called repeatedly for the same IP at few places within few microseconds from each other. It looks as if the SIGUSR1
was not blocked for re-entering the handler, as it should. Unless I am missing something, by default, the signal being handled should be blocked until the signal handler returns or the signal is explicitly re-enabled using pthread_sigmask
.
I guess it might be the core of the problem. I would suggest trying the following:
inject_activation_handler
entry (it can be queried using the pthread_sigmask
) to see if the kernel just forgets to block itinject_activation_handler
. This check would have to use TLS variable, which should not be used in async signal handlers in general, but if that variable is touched at thread creation time, it should be safe.So the second call to CheckActivationSafePoint
occurs when its result is true
and HandleGCSuspensionForInterruptedThread
is called. Because of the _ASSERTE(CheckActivationSafePoint(ip, /* checkingCurrentThread */ TRUE));
in HandleGCSuspensionForInterruptedThread
.
Ah, ok, I've forgotten about that assert.
I've incrementally tried a few more things.
siginfo->si_pid
ThreadState::TS_HijackRequested
to limit SIGUSR1 signals to each thread.SA_NODEFER
to avoid a hypothetical race with sigprocmask
I did not see any particular improvement.
I don't have any other ideas on how to work around/fix this issue.
I pushed the last debug code to https://github.com/sdmaclea/runtime/tree/debug50381.
We probably need to report this to Apple and/or revert #46657.
I think there are a few other tests failing with similar symptoms. I will look at these then start looking at other unrelated failures.
Hmm, looking at the events you've logged again, it seems as if the thread has entered the suspension handler and stayed there, but the thread that runs the suspension didn't detect it and kept trying to send the signal. That would explain why the signal stops being delivered.
.
.
.
18b560 0.824184250 : `SYNC` InjectGcSuspension InjectGcSuspension for thread 0x15600ac10
18b554 0.823864042 : `SYNC` HandleGCSuspensionForInterruptedThread isAtSafePoint for thread 0x15600ac10
I would suggest adding a stress log in HandleGCSuspensionForInterruptedThread
after the following:
pThread->PulseGCMode();
That would prove whether we are sitting in the HandleGCSuspensionForInterruptedThread
which means properly suspended or not.
In the core dumps the thread we are trying to interrupt shows up as still running managed code.
I created a feedback ticket to Apple.
@sdmaclea @janvorli Hi! Do you have any updates on this from Apple?
I didn't get any feedback from Apple.
@janvorli, assume this is intermittent even with GCStress? If so should move to 7.
Hi @janvorli,
I'm not sure if this is the same issue (I hope, if not please ignore), but our production code (450k lines app) runs into this on every single launch after upgrading to .NET6 from 3.1. Our code makes quite extensive use of threading, tasks etc. It appears that when the first GC is run the SuspendRuntime activation injection issue gets triggered (which for our app happens within the first few seconds of launching).
For release builds we build with latest NativeAOT 7 alpha bits and we don't see any errors at runtime there using the NativeAOT compiler/runtime. But during development, we use CoreCLR, and here, as mentioned, the error occurs – not intermittently but on every launch. We're using the official .NET 6 SDK 6.0.100 osx-x64.
Please note that this is running x64 code via Rosetta on an M1 chip. We're using the x64 dotnet SDK since our code is relying on external dynamic libs with x64 code.
Thread 29 Crashed:: .NET ThreadPool Worker
0 ??? 0x7ff89478e940 ???
1 libsystem_kernel.dylib 0x7ff8042d3112 __pthread_kill + 10
2 libsystem_c.dylib 0x7ff804255d10 abort + 123
3 libcoreclr.dylib 0x10b0b78db PROCAbort + 43
4 libcoreclr.dylib 0x10b08332c InjectActivationInternal(CorUnix::CPalThread*) + 44
5 libcoreclr.dylib 0x10b0bdb0b PAL_InjectActivation + 235
6 libcoreclr.dylib 0x10b2da55c ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON) + 492
7 libcoreclr.dylib 0x10b2dc3b2 ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON) + 402
8 libcoreclr.dylib 0x10b23c1b3 GCToEEInterface::SuspendEE(SUSPEND_REASON) + 35
9 libcoreclr.dylib 0x10b37bdb6 WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 470
10 libcoreclr.dylib 0x10b37e033 WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) + 739
11 libcoreclr.dylib 0x10b3a4000 WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 160
12 libcoreclr.dylib 0x10b2412b9 AllocateString(unsigned int) + 201
13 libcoreclr.dylib 0x10b25efad FramedAllocateString(unsigned int) + 141
Is this useful? Right now this issue is completely blocking us from adopting .NET 6, so I'd be happy to try to help isolate this and work with you on troubleshooting it if we can.
@christianscheuer this looks rather like #58111. A longer stack trace would likely confirm that. Do you have more stack frames logged?
Hi @janvorli. Thanks so much for the quick reply. Hm if I read #58111 correctly, that's about this happening on a macOS dispatch queue thread, correct? If you notice the header of the stacktrace, this is happening on a .NET ThreadPool Worker thread. Or maybe I'm misunderstanding?
The problem is actually not the thread that executes the activation injection, but rather the target thread. I guess you have another thread running managed code on a dispatch queue thread that's causing this problem. Current .NET main has a temporary fix that ignores the failure and lets the thread get suspended cooperatively (which may take a significant time).
Based on the Apple doc on the dispatch queue threads, running managed code seems to be a bad idea anyways, as the dispatch threads have more limitations than the one we are hitting. So if it is really your problem and you can prevent it, it would be the best fix for your issue.
I am planning to try to implement a different way for thread suspension for GC that would not have the problem with pthread_kill
, however I am not sure how soon I'll have time to look into it.
Hi Jan,
Thanks again for the quick follow up. I was just testing this and everything works well with .NET5 (as it also did in .NET 3.1). Unfortunately, we're very reliant on dispatch queue threads because of various Apple APIs we need to integrate with, and yes you're right the target thread is likely the issue:
Thread 11:: Dispatch queue: org.soundflow.sfsx.backend.automationqueue
0 ??? 0x7ff89478e940 ???
1 libsystem_kernel.dylib 0x7ff8042cf506 __psynch_cvwait + 10
2 libcoreclr.dylib 0x10b0af8db CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 315
3 libcoreclr.dylib 0x10b0af5aa CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 458
4 libcoreclr.dylib 0x10b0b3e2a CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1946
5 libcoreclr.dylib 0x10b0b404d WaitForSingleObjectEx + 77
6 libcoreclr.dylib 0x10b2d5118 CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 200
7 libcoreclr.dylib 0x10b2d9b58 Thread::RareDisablePreemptiveGC() + 344
8 libcoreclr.dylib 0x10b15b519 UnsafeJitFunction(PrepareCodeConfig*, COR_ILMETHOD_DECODER*, CORJIT_FLAGS, unsigned int*) + 1337
9 libcoreclr.dylib 0x10b1940f4 MethodDesc::JitCompileCodeLocked(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*, unsigned int*, CORJIT_FLAGS*) + 308
10 libcoreclr.dylib 0x10b193be6 MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*) + 630
11 libcoreclr.dylib 0x10b193306 MethodDesc::JitCompileCode(PrepareCodeConfig*) + 454
12 libcoreclr.dylib 0x10b192e87 MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 631
13 libcoreclr.dylib 0x10b109095 CodeVersionManager::PublishVersionableCodeIfNecessary(MethodDesc*, CallerGCMode, bool*, bool*) + 517
14 libcoreclr.dylib 0x10b195f94 MethodDesc::DoPrestub(MethodTable*, CallerGCMode) + 212
15 libcoreclr.dylib 0x10b195a37 PreStubWorker + 535
16 libcoreclr.dylib 0x10b3ab64b ThePreStub + 92
17 ??? 0x114e67833 ???
18 ??? 0x11374bf36 ???
19 ??? 0x11374b30f ???
20 ??? 0x11374b018 ???
21 ??? 0x11374afc3 ???
22 ??? 0x11374a088 ???
23 libdispatch.dylib 0x7ff80414fcc9 _dispatch_client_callout + 8
24 libdispatch.dylib 0x7ff804155cee _dispatch_lane_serial_drain + 696
25 libdispatch.dylib 0x7ff8041567c8 _dispatch_lane_invoke + 366
26 libdispatch.dylib 0x7ff8041607e1 _dispatch_workloop_worker_thread + 758
27 libsystem_pthread.dylib 0x7ff80430608f _pthread_wqthread + 326
28 libsystem_pthread.dylib 0x7ff80430501b start_wqthread + 15
Do you happen to know why this works (and has worked for 4+ years) reliably in .NET3.1 (both CoreCLR + CoreRT) and also appears to work reliably in .NET5 (CoreCLR) and works reliably in .NET6 (NativeAOT) but only not in CoreCLR .NET6?
Edit: Just to clarify this statement. The app we're migrating to .NET6 here has been used in production with CoreRT and used in debugging/development with CoreCLR on .NET3.1 since 2017 and never hitting any errors here.
It is due to a change in activation injection that I've made in .NET 6. It was not using signals (and pthread_kill
to send them) before. We were rather basically emulating async signal delivery to the thread by suspending the thread, manually modifying its context to redirect it to our suspension helper and then resuming the thread. But it turned out to be causing intermittent crashes when another process was sending async signals to the .NET process and the OS signal handling raced with the activation injection. The issue #44498 made us aware of the problem and in #46657 I've made the change.
To fix your problem, I'll try to get approval to port the #59045 that ignores the pthread_kill
failure to .NET 6. But IIRC, the next update is not going to be out until February. So I wonder if you could use a modified .NET 6 runtime with that patch applied (just libcoreclr.dylib would need to be updated), built either by you or me. Except of not running .NET code on the dispatch queue threads, I don't see a workaround for the problem.
Hi @janvorli,
That's awesome if you could get that approved! We can live with just upgrading to .NET5 for now and then wait until February (or whenever the next update comes out) to move on to .NET6. That timeframe is no problem at all.
I will definitely also do some research to see what it would take for us to move away from dispatch queue threads, but since we've been relying on them for so long, I am afraid I could be introducing subtle bugs if we do that too hastily, so it's nice to see if the scenario we have been relying on could be made to work again, so we can do any potential move away from DQ threads in our own time.
Really appreciate the quick & detailed responses. Thank you :)
@christianscheuer, @janvorli's change was ported to the 6.0.2 release with #63393
The JIT/jit64/opt/cse/VolatileTestop* family of tests sometimes deadlock in
ThreadSuspend::SuspendRuntime(...)
on Apple Silicon when inGCStress=c
mode.Looking at a core dump.
Thread 1 --
TestOp()
times out trying to suspend for GC. Thread 8 --TieredCompilationManager::TryDeactivateTieringDelay()
is constructing aGCCoop()
and while waiting in native code for a pthread cond wait. I tlooks like it is trying to force Thread 1 pThread=0x000000015c82f210
into preemptive mode. Thread 10 -- Is runningFlip()
. It is running GC interruptible code.Looks like the GC Mode for thread 10 is incorrect.
/cc @janvorli
Also affects