Closed filipnavara closed 2 years ago
Dump: https://1drv.ms/u/s!AgaMhbe7wFq8m41s3OXKH34CLOoa7Q?e=q931lK
cc @janvorli
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
@filipnavara I can see that there is managed code executed on a worker queue thread. Apple disallows sending async signals to such threads, so we cannot interrupt these and GC has to rely on those threads to sync with GC explicitly. If that doesn't happen, the runtime suspension would hang. I can see two threads running managed code on a worker queue thread in your thread sample. One of them is already suspended, but the other (Thread_350741) is not. Its call stack doesn't have managed code, but to hit the problem, it would be sufficient if it was ever executing managed code before. I should be able to see that from the dump once I get to it.
We have discussed this problem with the Apple threading team recently and they have came up with a solution to the problem. But that will require us to change the way we suspend runtime and it will work only on new macOS after they make a specific change.
Would there be a way for your app to not to run managed code on a worker queue thread?
Anyways, I'll investigate the dump you've shared later today to confirm that it is really what's causing the trouble here.
There are several things executed on the worker threads. One of them is some background indexing scheduled through NSBackgroundActivityScheduler
which is thread Thread_349538
that got suspended properly. We have alternative code path on Windows to run that. The other thing are callbacks from WebView
and NSUrlSession
. These callbacks are scheduled by Apple APIs and there doesn't seem to be a simple workaround for that.
Btw, I still have the process running so if there's anything else that I can collect I'd be happy to do so.
Its call stack doesn't have managed code, but to hit the problem, it would be sufficient if it was ever executing managed code before.
Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze.
it should be in preemptive mode and considered to be suspended
The preemptive mode doesn't work on dispatch/worker threads. Ref: https://github.com/dotnet/runtime/pull/63393
That said, I would still expect it not to freeze on that condition. I am well aware that I need to depend on coop suspension on the dispatch threads [when managed code is on the stack] and that it may cause the GC pauses to take longer than usual.
(Btw, in the state hit due to #63393 it ends up in busy loop and takes 100% CPU time. That's quite suboptimal even in the case where it eventually unlocks.)
I don't know what triggered it (SDK update, workload update?) but now I am literally unable to run the app for more than a minute until I get this freeze. I disabled all code that knowingly called APIs on dispatcher threads (NSUrlSession, NSBackgroundActivityScheduler) and I still hit one thread like this eventually:
1737 Thread_643010
1737 start_wqthread (in libsystem_pthread.dylib) + 8 [0x19237c018]
1737 _pthread_wqthread (in libsystem_pthread.dylib) + 364 [0x19237d350]
1737 __workq_kernreturn (in libsystem_kernel.dylib) + 8 [0x192347604]
Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze.
The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code. And since we cannot deliver the signal, all we know is that the thread is in our managed thread list, so we think we need to suspend it.
The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code
That sounds like a bug. Once we notice that the thread is preemptive, we can assume that it is suspended - we do not have to wait for the signal to be actually delivered.
Maybe we actually do that and I may be mistaken. I'll take a look to refresh my memory on what we do there.
I looked at the code a bit more in detail and realized that there's one important thing I omitted in the description. The process was started from VS Code with attached debugger. There could be something fishy in the code path with the attached debugger:
I tried to run the process again with no debugger attached and it seemed to behave much better. I didn't have time to do any extensive testing though.
@filipnavara when you say it behaves much better, do you actually mean that it still hangs sometimes?
I've noticed that in the dump you've shared, many threads have TS_DebugSuspendPending
state, but the actual thread suspension was triggered by GC. And none of the threads has TS_GCSuspendPending
. Hopefully I am decoding the thread state correctly. So I wonder if that could be some bug in the thread suspension leaving this state uncleared and later influencing the suspension process.
(lldb) clrthreads
ThreadCount: 117
UnstartedThread: 0
BackgroundThread: 60
PendingThread: 0
DeadThread: 8
Hosted Runtime: no
Lock
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
1 1 55378 0000000120012000 20028 Preemptive 0000000142B08148:0000000142B08158 000000012080BA00 -00001 Ukn
8 2 5539d 000000012001EC00 21228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Finalizer)
9 3 55428 0000000122035E00 21228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142372518)
10 4 55495 000000012085BA00 10a0228 Preemptive 0000000142916118:0000000142917D20 000000012080BA00 -00001 Ukn (Threadpool Worker)
11 5 55496 0000000122045400 2021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 7 0 0000000122058000 31820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 8 0 0000000122388600 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
15 9 55566 0000000161AE7200 10a1228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
16 10 55569 0000000108838200 10a1228 Preemptive 0000000142AD0720:0000000142AD0FA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
17 11 5556a 00000001204D7000 a1028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 12 0 00000001182DFC00 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 13 0 0000000161B24A00 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 14 0 0000000108816800 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
18 15 555a3 00000001182E3600 10a1228 Preemptive 0000000142AB9750:0000000142ABAFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 16 0 000000012186E600 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 17 0 0000000120550400 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
19 18 555af 0000000161B68A00 2021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 19 0 00000001210D0800 1031820 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
20 20 555d4 00000001183D7200 21228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
21 21 555e8 0000000161C8D400 a1228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 6 55551 0000000108938200 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
14 22 55562 00000001219AA800 a0228 Preemptive 0000000142AD5950:0000000142AD6FA0 000000012080BA00 -00001 Ukn
22 23 55975 0000000120788A00 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
23 24 55976 0000000108941400 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
24 25 55978 00000001219B9600 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
25 26 5598b 00000001207B8600 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
26 27 5598c 000000012006E400 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 28 559d7 000000012278B400 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
27 29 559f0 000000012276B800 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
28 30 559f2 0000000120784400 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
29 31 559f3 0000000118729400 3021228 Preemptive 0000000142A9B0D8:0000000142A9BD30 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 32 55991 0000000122733200 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
30 33 559f6 0000000120369400 2021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
31 34 559f7 0000000118232400 a1228 Preemptive 000000014294E360:000000014294FE00 000000012080BA00 -00001 Ukn
32 35 559ff 0000000161F1BE00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
39 36 55a06 000000010894CC00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
38 37 55a05 0000000118727400 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
34 38 55a01 0000000121896200 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
35 39 55a02 00000001208A7800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (000000014266ac48)
36 40 55a03 00000001219B5400 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
37 41 55a04 000000011896EA00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
33 42 55a00 0000000161D34600 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
40 43 55a08 00000001219CD000 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
41 44 55a09 000000011864A000 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
42 45 55a0a 0000000118982C00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
44 46 55a0c 00000001219DC200 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
43 47 55a0b 0000000166090C00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
45 48 55a0d 0000000118980E00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
46 49 55a0e 0000000121195600 2021028 Preemptive 00000001429BDD90:00000001429BEA78 000000012080BA00 -00001 Ukn
47 50 55a0f 0000000118957C00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
48 51 55a10 0000000108949800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
49 53 55a11 00000001211B1800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
50 54 55a12 000000012088D200 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142732488)
51 52 55a13 0000000108959000 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
52 55 55a14 0000000161C88C00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
54 56 55a17 0000000118999800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
55 57 55a19 00000001211AF400 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142665878)
56 58 55a1a 00000001186B3200 21028 Preemptive 0000000142B10178:0000000142B10190 000000012080BA00 -00001 Ukn (GC)
57 59 55a1b 00000001219CA000 21028 Preemptive 00000001429D6388:00000001429D6AE0 000000012080BA00 -00001 Ukn
58 60 55a1d 0000000108959800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
59 61 55a1e 0000000118961000 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
60 62 55a1f 000000010895D200 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
61 63 55a21 000000010895E800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
62 64 55a22 0000000118639A00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
63 65 55a25 000000011899D000 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
64 66 55a26 0000000108961A00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
65 67 55a27 000000011899D800 a1028 Preemptive 0000000142B09780:0000000142B0A158 000000012080BA00 -00001 Ukn
66 68 55a28 00000001189A1600 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
67 69 55a29 00000001189A0C00 21028 Preemptive 0000000142B0B408:0000000142B0C158 000000012080BA00 -00001 Ukn
68 70 55a2d 00000001189A2600 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
69 71 55a2e 000000011899F200 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
70 72 55a31 00000001219E1400 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
71 73 55a32 0000000161F6F000 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
72 74 55a33 0000000108962200 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
73 75 55a34 0000000108963C00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
74 76 55a35 00000001186CDA00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
75 77 55a36 000000011870C200 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
76 78 55a37 000000010896AE00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
77 79 55a38 0000000118840A00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
78 80 55a39 0000000118734400 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
79 81 55a3a 00000001189A4800 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
80 82 55a3b 0000000166036C00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
81 83 55a3c 0000000165873400 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
82 84 55a3d 0000000118735A00 2021028 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
83 85 55a48 0000000165922200 3021228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX 86 55990 00000001189A7400 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 87 55a16 00000001219DA600 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
53 88 55a15 0000000161D55000 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX 89 55993 000000011898D000 20228 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
85 90 55a6e 0000000108965400 10a1228 Preemptive 0000000142A9C7F8:0000000142A9DD30 000000012080BA00 -00001 Ukn (Threadpool Worker)
86 91 55a70 0000000108970200 10a1228 Preemptive 0000000142B01AB8:0000000142B02158 000000012080BA00 -00001 Ukn (Threadpool Worker)
87 92 55a97 0000000161FD2E00 10a1228 Preemptive 0000000142AB68B0:0000000142AB7F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
88 93 55af2 0000000161FA4E00 21228 Preemptive 0000000142A4E3B8:0000000142A4F568 000000012080BA00 -00001 Ukn
90 94 55afe 000000016588CC00 21228 Preemptive 0000000142A5C810:0000000142A5D598 000000012080BA00 -00001 Ukn
89 95 55afd 00000001219F1E00 10a1228 Preemptive 0000000142ADADF0:0000000142ADAFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
91 96 55b04 00000001189ACC00 1121228 Preemptive 0000000142AF0120:0000000142AF0120 000000012080BA00 -00001 Ukn (Threadpool Worker)
92 97 55b05 0000000108978800 10a1228 Preemptive 000000014291DC78:000000014291DD20 000000012080BA00 -00001 Ukn (Threadpool Worker)
93 98 55b06 00000001211D9E00 10a1228 Preemptive 0000000142915788:0000000142915D20 000000012080BA00 -00001 Ukn (Threadpool Worker)
94 99 55b07 0000000108966600 10a1228 Preemptive 00000001429B20C8:00000001429B2A78 000000012080BA00 -00001 Ukn (Threadpool Worker)
95 100 55bbb 00000001660E1600 1021228 Preemptive 0000000142988770:00000001429889D0 000000012080BA00 -00001 Ukn (Threadpool Worker)
96 101 55bbd 0000000121A02400 10a1228 Preemptive 0000000142AFF3C0:0000000142B00158 000000012080BA00 -00001 Ukn (Threadpool Worker)
98 102 55be2 0000000165905600 10a1228 Preemptive 0000000142ADB158:0000000142ADCFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
97 103 55be1 0000000118634600 10a1228 Preemptive 000000014293F170:000000014293FE00 000000012080BA00 -00001 Ukn (Threadpool Worker)
100 104 55c02 000000016596C000 10a1228 Preemptive 00000001429D9050:00000001429DAAE0 000000012080BA00 -00001 Ukn (Threadpool Worker) <Invalid Object> (00000001429d8af8) (nested exceptions)
99 105 55bfa 0000000166022000 10a1228 Preemptive 0000000142AA2978:0000000142AA3F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
101 106 55c0c 0000000161F92200 10a1228 Preemptive 0000000142AE5B70:0000000142AE6968 000000012080BA00 -00001 Ukn (Threadpool Worker)
105 107 55c1c 0000000161F94800 10a1228 Preemptive 0000000142AA64B0:0000000142AA7F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
106 108 55c1d 0000000165914A00 10a1228 Preemptive 0000000142AD3408:0000000142AD4FA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
102 109 55c15 00000001211DCE00 10a1228 Preemptive 0000000142AE6A10:0000000142AE8968 000000012080BA00 -00001 Ukn (Threadpool Worker)
103 110 55c16 0000000165915200 10a1228 Preemptive 0000000142AE8A10:0000000142AEA968 000000012080BA00 -00001 Ukn (Threadpool Worker)
104 111 55c17 0000000161F95000 10a1228 Preemptive 0000000142AF01C8:0000000142AF2120 000000012080BA00 -00001 Ukn (Threadpool Worker)
107 112 55c25 00000001200B8600 10a1228 Preemptive 00000001429DAFA8:00000001429DCAE0 000000012080BA00 -00001 Ukn (Threadpool Worker)
108 113 55c26 00000001189C1C00 21228 Preemptive 0000000142AFE038:0000000142AFE158 000000012080BA00 -00001 Ukn
109 114 55c2e 00000001659BAC00 21228 Preemptive 0000000142A7C538:0000000142A7D620 000000012080BA00 -00001 Ukn
110 115 591ea 0000000161F19A00 21220 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
111 116 591ec 00000001211EE600 21220 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
4 117 55393 000000011877CA00 20220 Preemptive 0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
when you say it behaves much better, do you actually mean that it still hangs sometimes?
I had just couple of minutes to try so I cannot tell for sure one way or another. It was not behaving jerky and I didn't hit a hang outside of debugger yet.
I've already been at this for couple of hours so I cannot tell for certain what I did before the freeze. The application was definitely freezing a lot under the debugger so I may have tried to pause it after the livelock occured.
From what I can see in the dump, it was actually not stuck waiting for the runtime to suspend. It was just starting the suspension here: https://github.com/dotnet/runtime/blob/048da75f59c975e9bf3c346bee4ce4a9edd11e78/src/coreclr/vm/threadsuspend.cpp#L3308
So I wonder what's causing the hang. I can see the thread 110 also wanting to suspend runtime in ep_rt_coreclr_sample_profiler_write_sampling_event_for_threads. I wonder if there is some kind of suspension train happening between the GC suspension and the profiling suspension. As can be seen above, there is over 100 threads running so maybe it takes the profiling a long time to walk all the stacks of the threads and then a time to take a next sample comes right after or something along these lines. The sampling rate seems to be set to 1ms.
I also wonder why is the profiler sampling running. sampling_thread
@filipnavara how did you get the dump? Have you used the createdump tool? It seems that the dump is missing some memory areas and SOS cannot find managed function names in most cases, which is usually due to the dump taken by the OS (and maybe lldb).
The dump was created with createdump
manually. The sampling profiler was likely a futile attempt of me to diagnose the already stuck process. I'll try to reproduce the same thing without all these extra tools.
Here's a new one without the sampling profiler (at least I didn't invoke it externally) and with createdump --full
:
I run it to the point where it appeared freezing. Then I hit "pause" in the debugger which never finished. After that I manually run sample
and createdump
.
Great, thank you! I'll take a look.
@filipnavara the dump looks similar to the previous one, except for the sampling profiler stuff not being there (as expected) and the managed method names being shown correctly via SOS clrstack now.
It still has many threads with the TS_DebugSuspendPending
state. So it is remotely possible it is some problem related to the debugger thread suspension. Could you please try to repro the problem without the debugger? The coredump should be able to take a dump of a live process. This would help me to find out where to focus the investigations.
It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress. There is one more thing that's the same between those two dumps - in both cases the thread that's suspending for GC is in the libsystem_kernel.dylib task_threads
called from the FlushProcessWriteBuffers
, which is kind of a strange coincidence. I wonder if it is possible that this OS call ends up waiting for something forever for some reason.
So it is remotely possible it is some problem related to the debugger thread suspension. Could you please try to repro the problem without the debugger?
I will try but so far that has been unsuccessful.
It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress.
Oh, I still have the process from morning running so I can do more dumps. One of the threads is stuck in a busy loop.
I tried to attach with LLDB and put couple of breakpoints but none of them seem to be hit 🤷🏻
1: name = 'RestartEE', locations = 3, resolved = 3, hit count = 0
1.1: where = libcoreclr.dylib`GCToEEInterface::RestartEE(bool), address = 0x0000000104348dfc, resolved, hit count = 0
1.2: where = libcoreclr.dylib`ThreadSuspend::RestartEE(int, int), address = 0x00000001043dd494, resolved, hit count = 0
1.3: where = libcoreclr.dylib`standalone::GCToEEInterface::RestartEE(bool), address = 0x00000001043f8948, resolved, hit count = 0
2: name = 'CORDebuggerAttached', locations = 0 (pending)
3: name = 'ThreadsAtUnsafePlaces', locations = 1, resolved = 1, hit count = 0
3.1: where = libcoreclr.dylib`Debugger::ThreadsAtUnsafePlaces(), address = 0x00000001044dc1f0, resolved, hit count = 0
4: name = 'FlushProcessWriteBuffers', locations = 5, resolved = 5, hit count = 0
4.1: where = libclrjit.dylib`FlushProcessWriteBuffers, address = 0x0000000103a4df18, resolved, hit count = 0
4.2: where = libcoreclr.dylib`FlushProcessWriteBuffers, address = 0x00000001041e4228, resolved, hit count = 0
4.3: where = libcoreclr.dylib`GCToOSInterface::FlushProcessWriteBuffers(), address = 0x000000010434c5c0, resolved, hit count = 0
4.4: where = libdbgshim.dylib`FlushProcessWriteBuffers, address = 0x0000000103763160, resolved, hit count = 0
4.5: where = libmscordaccore.dylib`FlushProcessWriteBuffers, address = 0x0000000103f068cc, resolved, hit count = 0
5: name = 'ThreadSuspend::SuspendRuntime', locations = 1, resolved = 1, hit count = 0
5.1: where = libcoreclr.dylib`ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON), address = 0x00000001043dbd58, resolved, hit count = 0
6: name = 'ThreadSuspend::SuspendEE', locations = 1, resolved = 1, hit count = 0
6.1: where = libcoreclr.dylib`ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON), address = 0x00000001043dd738, resolved, hit count = 0
(it could be LLDB misbehaving so I would not necessarily draw conclusions from it)
Nevertheless, there is something interesting in the LLDB stack trace:
(lldb) thread select 62
* thread #62, name = '.NET ThreadPool Worker'
frame #0: 0x0000000192345a5c libsystem_kernel.dylib`swtch_pri + 8
libsystem_kernel.dylib`swtch_pri:
-> 0x192345a5c <+8>: ret
libsystem_kernel.dylib`swtch:
0x192345a60 <+0>: mov x16, #-0x3c
0x192345a64 <+4>: svc #0x80
0x192345a68 <+8>: ret
(lldb) bt
* thread #62, name = '.NET ThreadPool Worker'
* frame #0: 0x0000000192345a5c libsystem_kernel.dylib`swtch_pri + 8
frame #1: 0x000000019237e0e4 libsystem_pthread.dylib`cthread_yield + 20
frame #2: 0x00000001041e7ec8 libcoreclr.dylib`MachSetThreadContext + 52
frame #3: 0x00000001041e7e60 libcoreclr.dylib`PAL_DispatchException + 292
frame #4: 0x00000001041e7a54 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
frame #5: 0x00000001044dc1f0 libcoreclr.dylib`Debugger::SetModuleDefaultJMCStatus(Module*, bool) + 284
frame #6: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(SUSPEND_REASON) + 48
frame #7: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 420
frame #8: 0x000000010446dde0 libcoreclr.dylib`WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) + 792
frame #9: 0x0000000104490b58 libcoreclr.dylib`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 164
frame #10: 0x000000010434df70 libcoreclr.dylib`AllocateObject(MethodTable*) + 216
frame #11: 0x0000000104369bdc libcoreclr.dylib`JIT_New(CORINFO_CLASS_STRUCT_*) + 124
... and after putting few more breakpoints I started getting this:
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x34c19bef8)
* frame #0: 0x00000001923981b8 libsystem_platform.dylib`_platform_memmove + 168
frame #1: 0x00000001041e8428 libcoreclr.dylib`SEHExceptionThread(void*) + 1372
frame #2: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148
and the process died :/
The crash in memmove
migh be related to a stack overflow handling issue I've fixed few days ago. Just a wild guess though, maybe fiddling with the debugging resulted in stack overflow somewhere somehow.
Anyways, the two extra dumps have revealed that the thread that tries to suspend the runtime is actually progressing. In one case, it was running in SuspendRuntime, in other in RestartEE:
case 1:
(lldb) bt
* thread #62, stop reason = signal SIGSTOP
* frame #0: 0x0000000192345954 libsystem_kernel.dylib`mach_msg_trap + 8
frame #1: 0x0000000192345d00 libsystem_kernel.dylib`mach_msg + 76
frame #2: 0x000000019234e23c libsystem_kernel.dylib`thread_get_state + 140
frame #3: 0x00000001923558a0 libsystem_kernel.dylib`thread_get_register_pointer_values + 132
frame #4: 0x00000001041e4294 libcoreclr.dylib`::FlushProcessWriteBuffers() at process.cpp:3569:23 [opt]
frame #5: 0x00000001043dbe6c libcoreclr.dylib`ThreadSuspend::SuspendRuntime(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:3297:5 [opt]
frame #6: 0x00000001043dd858 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5702:5 [opt]
frame #7: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=<unavailable>) at gcenv.ee.cpp:26:5 [opt]
frame #8: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:44676:9 [opt]
case 2:
(lldb) bt
* thread #62, stop reason = signal SIGSTOP
* frame #0: 0x0000000192345954 libsystem_kernel.dylib`mach_msg_trap + 8
frame #1: 0x0000000192345d00 libsystem_kernel.dylib`mach_msg + 76
frame #2: 0x000000019234e23c libsystem_kernel.dylib`thread_get_state + 140
frame #3: 0x00000001923558a0 libsystem_kernel.dylib`thread_get_register_pointer_values + 132
frame #4: 0x00000001041e4294 libcoreclr.dylib`::FlushProcessWriteBuffers() at process.cpp:3569:23 [opt]
frame #5: 0x00000001043dd4e4 libcoreclr.dylib`ThreadSuspend::RestartEE(bFinishedGC=NO, SuspendSucceded=NO) at threadsuspend.cpp:5523:5 [opt]
frame #6: 0x00000001043dd890 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5736:9 [opt]
frame #7: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=<unavailable>) at gcenv.ee.cpp:26:5 [opt]
frame #8: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:44676:9 [opt]
So it seems it is spinning in the ThreadSuspend::SuspendEE
in this loop:
https://github.com/dotnet/runtime/blob/4d39501439d438b9c749c7d2bca5b3c1a60c823b/src/coreclr/vm/threadsuspend.cpp#L5661-L5768
I keep investigating ...
The g_pDebugInterface->ThreadsAtUnsafePlaces() is returning true because the underlying Debugger::m_threadsAtUnsafePlaces
is 2. It is possible that there is a bug causing an imbalance in this count (we maintain the count by incrementing / decrementing it in the debugger related code) or that there is a bug in evaluation of the unsafe places in the code called from Debugger::IsThreadAtSafePlace
.
Getting stuck in that loop seems to be in line with the fact that I can easily reproduce it with a debugger but not without one. That's also why I set the specific breakpoints in LLDB. I assume the process was already somehow broken so the debugger didn't work correctly and did not hit them. I'll check if there's some way to provide more logging around m_threadsAtUnsafePlaces
.
I have also found that there was a strange exception attached to one of the threads:
(lldb) pe
Exception object: 0000000131c8db28
Exception type: System.Security.SecurityException
Message: ECall methods must be packaged into a system module.
InnerException: <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 8013150a
@jkotas do you have any idea what could this mean? I am not sure how a FCall could occur in something else than S.P.C.
@filipnavara we could add stresslog logging at the places where the m_threadsAtUnsafePlaces
is incremented and decremented (we would need to put it at the call sites of the Inc/DecThreadsAtUnsafePlaces so that we can log the thread pointer).
Like this (at some places, the variable holding the thread pointer is thread
and at some pThread
STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling IncThreadsAtUnsafePlaces for thread %p\n", thread);
STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling DecThreadsAtUnsafePlaces for thread %p\n", thread);
Would you like me to compile a patched version of libcoreclr.dylib for you or would you do that yourself?
I can compile it myself.
Great! Once you have the patched version, then before running your app, couple of COMPlus variables need to be set to enable the stresslog. This is what I usually do - it sets the maximum size of the stresslog and enables it for the LF_CORDBG facility:
COMPlus_StressLog=1 COMPlus_LogLevel=A COMPlus_LogFacility=200 COMPlus_StressLogSize=2000000 COMPlus_TotalStressLogSize=40000000
This ends up recording the stresslog events in memory and they are later dumped into the crash dump and I can dump them using the dumplog
SOS command.
@jkotas do you have any idea what could this mean? I am not sure how a FCall could occur in something else than S.P.C.
The built-in COM interop uses [MethodImpl(MethodImplOptions.InternalCall)]
extern
methods. The runtime fills in the implementation on Windows. Built-in COM interop does not exist on non-Windows and you can end up with this exception if you try to execute the Windows-specific COM interop code. I guess the app has Windows-specific code that uses COM interop that is wrapped inside try/catch.
For reference, this is an example that will generate this exception:
using System.Runtime.CompilerServices;
foo();
[MethodImpl(MethodImplOptions.InternalCall)]
static extern void foo();
I guess the app has Windows-specific code that uses COM interop that is wrapped inside try/catch.
I will check on that. We do have COM code but it should be compiled-out on the macOS builds. It's possible that we missed some place though.
Sorry, had a bit of a pain with Xcode and Debug/Release builds (the Debug ones do some additional validation that is extremely slow in combination with how Xamarin does binding calls).
Hopefully the dump should contain the stress logs. This time I didn't even pause it in the debugger [I think].
sample.48456.txt coredump.48456.zip libmscor.zip
I left the process running just in case there's something that can be inspected with LLDB.
Thank you, I am taking a look.
@filipnavara I've forgotten that I'll need the libcoreclr.dylib, libcoreclr.dylib.dwarf and libmscordaccore.dylib.dwarf files that you've rebuilt to enable SOS work with the dump. Could you please share those with me? You can also pack all the .dylib and .dylib.dwarf files.
Ah, I expected that but I packed just couple of them in the post above 😅 I will pack the rest and upload it in a moment.
It is interesting. The m_threadsAtUnsafePlaces = 0x00000004, but dumping all the threads reported in the stresslog, there is only one thread that has inc and doesn't have dec - the 000000033DB1AC00 .
Are you sure you've put the logging at all places where we call the IncThreadsAtUnsafePlaces
and DecThreadsAtUnsafePlaces
? I can see there are 5 such places for each.
Based on the date stamps in the log, it seems the log is not missing any records.
But even looking at this single thread, I think it revealed the root of issue. There is a hardware exception being processed on this thread and the debugger machinery hook for the exception ended up waiting for the runtime suspension event. The problem is that the IncThreadsAtUnsafePlaces was called at this thread, but the corresponding DecThreadsAtUnsafePlaces was not called yet. So the runtime suspension can never succeed, since there is at least 1 in the counter of threads at unsafe places. Hence the deadlock.
(lldb) bt
* thread #119, stop reason = signal SIGSTOP
* frame #0: 0x00000001923490c0 libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000192381808 libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000101d5be1c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000033db1ca38, dwTimeout=4294967295, ptw rWakeupReason=0x00000003498012fc, pdwSignaledObject=0x00000003498012f8) at synchmanager.cpp:478:28 [opt]
frame #3: 0x0000000101d5ba8c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0000000125008e00, pthrCurrent=0x000000033db1c800, dwTimeout=429 4967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x0000000349801388, pdwSignaledObject=0x000000034980138c) at synchmanager.cpp:301:22 [opt]
frame #4: 0x0000000101d5fba0 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000033db1c800, nCount=<unavailable>, lpHandles=<unavailable>, bW aitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]
frame #5: 0x0000000101d5fd80 libcoreclr.dylib`::WaitForSingleObjectEx(hHandle=0x00000000000018c8, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:138:13 [opt]
frame #6: 0x0000000101f55c2c libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper2(handle=<unavailable>, dwMillis econds=<unavailable>, alertable=<unavailable>) at synch.cpp:372:12 [opt]
frame #7: 0x0000000101f55c28 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper(this=<unavailable>, pParam=0x00 00000349801588)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:397:26 [opt]
frame #8: 0x0000000101f55c20 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at synch.cpp:399:5 [opt]
frame #9: 0x0000000101f55bc8 libcoreclr.dylib`CLREventBase::WaitEx(this=<unavailable>, dwMilliseconds=<unavailable>, mode=<unavailable>, syncState=<unavailable>) at synch.cpp:466:20 [opt]
frame #10: 0x0000000101f5b8d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper(this=0x000000033db1ac00) at threadsuspend.cpp:4626:50 [opt]
frame #11: 0x0000000101f5a538 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() at threadsuspend.cpp:4663:13 [opt]
frame #12: 0x0000000101f5a530 libcoreclr.dylib`Thread::RareEnablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2414:13 [opt]
frame #13: 0x0000000101f59e58 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() [inlined] Thread::EnablePreemptiveGC(this=0x000000033db1ac00) at threads.h:2044:13 [ opt]
frame #14: 0x0000000101f59e3c libcoreclr.dylib`Thread::RareDisablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2147:13 [opt]
frame #15: 0x000000010204ccc8 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHold erEEInterface<0, 1, 1>::LeaveInternal(this=<unavailable>) at debugger.h:260:29 [opt]
frame #16: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHold erEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:303:15 [opt]
frame #17: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHold erEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:300:5 [opt]
frame #18: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(thread=0x000000033db1ac00, context=0x000000033c1f7000, address="\xfd{\xb e\xa9\xf3S\U00000001\xa9\xfd\U00000003", which=<unavailable>) at controller.cpp:3028:9 [opt]
frame #19: 0x000000010204decc libcoreclr.dylib`DebuggerController::DispatchNativeException(pException=<unavailable>, pContext=0x000000033c1f7000, dwCode=2147483651, pCurThread=0x000000033db1ac00) at controller.cpp:0:24 [opt]
frame #20: 0x000000010205a5e4 libcoreclr.dylib`Debugger::FirstChanceNativeException(this=0x0000000124707ae0, exception=0x000000033c1f7390, context=0x000000033c1f7000 , code=2147483651, thread=0x000000033db1ac00) at debugger.cpp:5574:18 [opt]
frame #21: 0x0000000101f87914 libcoreclr.dylib`HandleHardwareException(ex=0x0000000349801ca0) at exceptionhandling.cpp:5313:36 [opt]
frame #22: 0x0000000101d34f3c libcoreclr.dylib`SEHProcessException(exception=0x0000000349801ca0) at seh.cpp:267:21 [opt]
frame #23: 0x0000000101d68e88 libcoreclr.dylib`::PAL_DispatchException(pContext=0x00000003498028c0, pExRecord=0x0000000349802828, pMachExceptionInfo=0x00000003498022 e8) at machexception.cpp:382:29 [opt]
frame #24: 0x0000000101d68b18 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
frame #25: 0x00000002888f1358
frame #26: 0x0000000101e880c8 libcoreclr.dylib`DispatchCallSimple(unsigned long*, unsigned int, unsigned long, unsigned int) at callhelpers.cpp:67:5 [opt]
frame #27: 0x0000000101e8806c libcoreclr.dylib`DispatchCallSimple(pSrc=<unavailable>, numStackSlotsToCopy=<unavailable>, pTargetAddress=<unavailable>, dwDispatchCall SimpleFlags=<unavailable>) at callhelpers.cpp:220:9 [opt]
frame #28: 0x0000000101e9d180 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(ptr=<unavailable>) at comsynchronizable.cpp:157:5 [opt]
frame #29: 0x0000000101e53184 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavail able>) at threads.cpp:7321:5 [opt]
frame #30: 0x0000000101e53180 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7365:9 [opt]
frame #31: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads. cpp:7523:13 [opt]
frame #32: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7525:9 [opt]
frame #33: 0x0000000101e53114 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x0000000349802f08) at threads.cpp:7549:5 [opt]
frame #34: 0x0000000101e536f0 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, a rgs=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7569:5 [opt]
frame #35: 0x0000000101e536dc libcoreclr.dylib`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7604:5 [opt]
frame #36: 0x0000000101e9d258 libcoreclr.dylib`ThreadNative::KickOffThread(pass=0x000000033db1ac00) at comsynchronizable.cpp:228:9 [opt]
frame #37: 0x0000000101d66bc0 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000033db1c800) at thread.cpp:1862:16 [opt]
frame #38: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148
(lldb) bt
* thread #119, stop reason = signal SIGSTOP
* frame #0: 0x00000001923490c0 libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000192381808 libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000101d5be1c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000033db1ca38, dwTimeout=4294967295, ptwrWakeupReason=0x00000003498012fc, pdwSignaledObject=0x00000003498012f8) at synchmanager.cpp:478:28 [opt]
frame #3: 0x0000000101d5ba8c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0000000125008e00, pthrCurrent=0x000000033db1c800, dwTimeout=4294967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x0000000349801388, pdwSignaledObject=0x000000034980138c) at synchmanager.cpp:301:22 [opt]
frame #4: 0x0000000101d5fba0 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000033db1c800, nCount=<unavailable>, lpHandles=<unavailable>, bWaitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]
frame #5: 0x0000000101d5fd80 libcoreclr.dylib`::WaitForSingleObjectEx(hHandle=0x00000000000018c8, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:138:13 [opt]
frame #6: 0x0000000101f55c2c libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper2(handle=<unavailable>, dwMilliseconds=<unavailable>, alertable=<unavailable>) at synch.cpp:372:12 [opt]
frame #7: 0x0000000101f55c28 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper(this=<unavailable>, pParam=0x0000000349801588)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:397:26 [opt]
frame #8: 0x0000000101f55c20 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at synch.cpp:399:5 [opt]
frame #9: 0x0000000101f55bc8 libcoreclr.dylib`CLREventBase::WaitEx(this=<unavailable>, dwMilliseconds=<unavailable>, mode=<unavailable>, syncState=<unavailable>) at synch.cpp:466:20 [opt]
frame #10: 0x0000000101f5b8d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper(this=0x000000033db1ac00) at threadsuspend.cpp:4626:50 [opt]
frame #11: 0x0000000101f5a538 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() at threadsuspend.cpp:4663:13 [opt]
frame #12: 0x0000000101f5a530 libcoreclr.dylib`Thread::RareEnablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2414:13 [opt]
frame #13: 0x0000000101f59e58 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() [inlined] Thread::EnablePreemptiveGC(this=0x000000033db1ac00) at threads.h:2044:13 [opt]
frame #14: 0x0000000101f59e3c libcoreclr.dylib`Thread::RareDisablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2147:13 [opt]
frame #15: 0x000000010204ccc8 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHolderEEInterface<0, 1, 1>::LeaveInternal(this=<unavailable>) at debugger.h:260:29 [opt]
frame #16: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHolderEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:303:15 [opt]
frame #17: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHolderEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:300:5 [opt]
frame #18: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(thread=0x000000033db1ac00, context=0x000000033c1f7000, address="\xfd{\xbe\xa9\xf3S\U00000001\xa9\xfd\U00000003", which=<unavailable>) at controller.cpp:3028:9 [opt]
frame #19: 0x000000010204decc libcoreclr.dylib`DebuggerController::DispatchNativeException(pException=<unavailable>, pContext=0x000000033c1f7000, dwCode=2147483651, pCurThread=0x000000033db1ac00) at controller.cpp:0:24 [opt]
frame #20: 0x000000010205a5e4 libcoreclr.dylib`Debugger::FirstChanceNativeException(this=0x0000000124707ae0, exception=0x000000033c1f7390, context=0x000000033c1f7000, code=2147483651, thread=0x000000033db1ac00) at debugger.cpp:5574:18 [opt]
frame #21: 0x0000000101f87914 libcoreclr.dylib`HandleHardwareException(ex=0x0000000349801ca0) at exceptionhandling.cpp:5313:36 [opt]
frame #22: 0x0000000101d34f3c libcoreclr.dylib`SEHProcessException(exception=0x0000000349801ca0) at seh.cpp:267:21 [opt]
frame #23: 0x0000000101d68e88 libcoreclr.dylib`::PAL_DispatchException(pContext=0x00000003498028c0, pExRecord=0x0000000349802828, pMachExceptionInfo=0x00000003498022e8) at machexception.cpp:382:29 [opt]
frame #24: 0x0000000101d68b18 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
frame #25: 0x00000002888f1358
frame #26: 0x0000000101e880c8 libcoreclr.dylib`DispatchCallSimple(unsigned long*, unsigned int, unsigned long, unsigned int) at callhelpers.cpp:67:5 [opt]
frame #27: 0x0000000101e8806c libcoreclr.dylib`DispatchCallSimple(pSrc=<unavailable>, numStackSlotsToCopy=<unavailable>, pTargetAddress=<unavailable>, dwDispatchCallSimpleFlags=<unavailable>) at callhelpers.cpp:220:9 [opt]
frame #28: 0x0000000101e9d180 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(ptr=<unavailable>) at comsynchronizable.cpp:157:5 [opt]
frame #29: 0x0000000101e53184 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7321:5 [opt]
frame #30: 0x0000000101e53180 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7365:9 [opt]
frame #31: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7523:13 [opt]
frame #32: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7525:9 [opt]
frame #33: 0x0000000101e53114 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x0000000349802f08) at threads.cpp:7549:5 [opt]
frame #34: 0x0000000101e536f0 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, args=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7569:5 [opt]
frame #35: 0x0000000101e536dc libcoreclr.dylib`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7604:5 [opt]
frame #36: 0x0000000101e9d258 libcoreclr.dylib`ThreadNative::KickOffThread(pass=0x000000033db1ac00) at comsynchronizable.cpp:228:9 [opt]
frame #37: 0x0000000101d66bc0 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000033db1c800) at thread.cpp:1862:16 [opt]
frame #38: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148
Hmm, I likely missed some places. The search in VS Code definitely showed me less than 5... (not at computer right now to double check)
This is the code that called the IncThreadsAtUnsafePlaces and would call the corresponding DecThreadsAtUnsafePlaces if it didn't get suspended: https://github.com/dotnet/runtime/blob/2eb4efb129b2c0ed79069ac5dc9a46a64675b85e/src/coreclr/debug/ee/controller.cpp#L2977-L3029
Yeah, I placed the logs on this one for sure. Maybe I did some cut & paste error, will check.
UPD:
// Mark if we're at an unsafe place.
bool atSafePlace = g_pDebugger->IsThreadAtSafePlace(thread);
if (!atSafePlace)
{
STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling IncThreadsAtUnsafePlaces for thread %p\n", thread);
g_pDebugger->IncThreadsAtUnsafePlaces();
}
and
if (!atSafePlace)
{
g_pDebugger->DecThreadsAtUnsafePlaces();
STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling DecThreadsAtUnsafePlaces for thread %p\n", thread);
}
but I missed the code below:
if (used == DPOSS_USED_WITH_EVENT)
{
bool atSafePlace = g_pDebugger->IsThreadAtSafePlace(thread);
if (!atSafePlace)
{
g_pDebugger->IncThreadsAtUnsafePlaces();
}
// Always pulse the GC mode. This will allow an async break to complete even if we have a patch
// at an unsafe place.
// If we are at an unsafe place, then we can't do a GC.
thread->PulseGCMode();
if (!atSafePlace)
{
g_pDebugger->DecThreadsAtUnsafePlaces();
}
}
The hardware exception being handled is actually a debugger breakpoint.
Hmm. I am not quite sure what breakpoint could it be...
I think debugger places its own internal breakpoints in some cases. This seems to be a thread start breakpoint.
This is the managed part of the stack of the thread:
(lldb) clrstack
OS Thread Id: 0x112683 (119)
Child SP IP Call Site
0000000349802C60 00000002888F1358 System.Threading.Thread.StartCallback()
0000000349802E08 0000000102016b48 [DebuggerU2MCatchHandlerFrame: 0000000349802e08]
As you can see the IncThreadsAtUnsafePlaces
is called conditionally - only if the g_pDebugger->IsThreadAtSafePlace(thread)
returned false. The question is why it returned false in this case and whether it was correct. The breakpoint is set at the first instruction of the System.Threading.Thread.StartCallback()
, that means it is in its prolog.
Description
Our process routinely seems to freeze. There are some dispatcher threads which may be interfering with the thread suspension logic but in this particular case the only one seems to be running CoreCLR code.
Reproduction Steps
No idea so far, happens randomly but consistently in first few minutes of the process run.
Expected behavior
No process freeze.
Actual behavior
Process becomes locked and non-responsive.
Regression?
No response
Known Workarounds
No response
Configuration
.NET 6.0.3, Xamarin.Mac Preview 14 (happend on Preview 13 too)
Other information
Thread sample: https://gist.github.com/filipnavara/afdd426069dfa00f18efa5b8508dd34c