dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

GC deadlock/livelock? #66759

Closed filipnavara closed 2 years ago

filipnavara commented 2 years ago

Description

Our process routinely seems to freeze. There are some dispatcher threads which may be interfering with the thread suspension logic but in this particular case the only one seems to be running CoreCLR code.

Reproduction Steps

No idea so far, happens randomly but consistently in first few minutes of the process run.

Expected behavior

No process freeze.

Actual behavior

Process becomes locked and non-responsive.

Regression?

No response

Known Workarounds

No response

Configuration

.NET 6.0.3, Xamarin.Mac Preview 14 (happend on Preview 13 too)

Other information

Thread sample: https://gist.github.com/filipnavara/afdd426069dfa00f18efa5b8508dd34c

filipnavara commented 2 years ago

Dump: https://1drv.ms/u/s!AgaMhbe7wFq8m41s3OXKH34CLOoa7Q?e=q931lK

cc @janvorli

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

janvorli commented 2 years ago

@filipnavara I can see that there is managed code executed on a worker queue thread. Apple disallows sending async signals to such threads, so we cannot interrupt these and GC has to rely on those threads to sync with GC explicitly. If that doesn't happen, the runtime suspension would hang. I can see two threads running managed code on a worker queue thread in your thread sample. One of them is already suspended, but the other (Thread_350741) is not. Its call stack doesn't have managed code, but to hit the problem, it would be sufficient if it was ever executing managed code before. I should be able to see that from the dump once I get to it.

We have discussed this problem with the Apple threading team recently and they have came up with a solution to the problem. But that will require us to change the way we suspend runtime and it will work only on new macOS after they make a specific change.

Would there be a way for your app to not to run managed code on a worker queue thread?

Anyways, I'll investigate the dump you've shared later today to confirm that it is really what's causing the trouble here.

filipnavara commented 2 years ago

There are several things executed on the worker threads. One of them is some background indexing scheduled through NSBackgroundActivityScheduler which is thread Thread_349538 that got suspended properly. We have alternative code path on Windows to run that. The other thing are callbacks from WebView and NSUrlSession. These callbacks are scheduled by Apple APIs and there doesn't seem to be a simple workaround for that.

filipnavara commented 2 years ago

Btw, I still have the process running so if there's anything else that I can collect I'd be happy to do so.

jkotas commented 2 years ago

Its call stack doesn't have managed code, but to hit the problem, it would be sufficient if it was ever executing managed code before.

Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze.

filipnavara commented 2 years ago

it should be in preemptive mode and considered to be suspended

The preemptive mode doesn't work on dispatch/worker threads. Ref: https://github.com/dotnet/runtime/pull/63393

That said, I would still expect it not to freeze on that condition. I am well aware that I need to depend on coop suspension on the dispatch threads [when managed code is on the stack] and that it may cause the GC pauses to take longer than usual.

(Btw, in the state hit due to #63393 it ends up in busy loop and takes 100% CPU time. That's quite suboptimal even in the case where it eventually unlocks.)

filipnavara commented 2 years ago

I don't know what triggered it (SDK update, workload update?) but now I am literally unable to run the app for more than a minute until I get this freeze. I disabled all code that knowingly called APIs on dispatcher threads (NSUrlSession, NSBackgroundActivityScheduler) and I still hit one thread like this eventually:

    1737 Thread_643010
      1737 start_wqthread  (in libsystem_pthread.dylib) + 8  [0x19237c018]
        1737 _pthread_wqthread  (in libsystem_pthread.dylib) + 364  [0x19237d350]
          1737 __workq_kernreturn  (in libsystem_kernel.dylib) + 8  [0x192347604]
janvorli commented 2 years ago

Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze.

The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code. And since we cannot deliver the signal, all we know is that the thread is in our managed thread list, so we think we need to suspend it.

jkotas commented 2 years ago

The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code

That sounds like a bug. Once we notice that the thread is preemptive, we can assume that it is suspended - we do not have to wait for the signal to be actually delivered.

janvorli commented 2 years ago

Maybe we actually do that and I may be mistaken. I'll take a look to refresh my memory on what we do there.

janvorli commented 2 years ago

Seems you are right: https://github.com/dotnet/runtime/blob/048da75f59c975e9bf3c346bee4ce4a9edd11e78/src/coreclr/vm/threadsuspend.cpp#L3372-L3378

filipnavara commented 2 years ago

I looked at the code a bit more in detail and realized that there's one important thing I omitted in the description. The process was started from VS Code with attached debugger. There could be something fishy in the code path with the attached debugger:

https://github.com/dotnet/runtime/blob/048da75f59c975e9bf3c346bee4ce4a9edd11e78/src/coreclr/vm/threadsuspend.cpp#L5725

I tried to run the process again with no debugger attached and it seemed to behave much better. I didn't have time to do any extensive testing though.

janvorli commented 2 years ago

@filipnavara when you say it behaves much better, do you actually mean that it still hangs sometimes?

I've noticed that in the dump you've shared, many threads have TS_DebugSuspendPending state, but the actual thread suspension was triggered by GC. And none of the threads has TS_GCSuspendPending. Hopefully I am decoding the thread state correctly. So I wonder if that could be some bug in the thread suspension leaving this state uncleared and later influencing the suspension process.

(lldb) clrthreads
ThreadCount:      117
UnstartedThread:  0
BackgroundThread: 60
PendingThread:    0
DeadThread:       8
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1    55378 0000000120012000    20028 Preemptive  0000000142B08148:0000000142B08158 000000012080BA00 -00001 Ukn
   8    2    5539d 000000012001EC00    21228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Finalizer)
   9    3    55428 0000000122035E00    21228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142372518)
  10    4    55495 000000012085BA00  10a0228 Preemptive  0000000142916118:0000000142917D20 000000012080BA00 -00001 Ukn (Threadpool Worker)
  11    5    55496 0000000122045400  2021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX    7        0 0000000122058000    31820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX    8        0 0000000122388600  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  15    9    55566 0000000161AE7200  10a1228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  16   10    55569 0000000108838200  10a1228 Preemptive  0000000142AD0720:0000000142AD0FA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  17   11    5556a 00000001204D7000    a1028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   12        0 00000001182DFC00  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   13        0 0000000161B24A00  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   14        0 0000000108816800  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  18   15    555a3 00000001182E3600  10a1228 Preemptive  0000000142AB9750:0000000142ABAFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   16        0 000000012186E600  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   17        0 0000000120550400  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  19   18    555af 0000000161B68A00  2021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   19        0 00000001210D0800  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  20   20    555d4 00000001183D7200    21228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  21   21    555e8 0000000161C8D400    a1228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX    6    55551 0000000108938200    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  14   22    55562 00000001219AA800    a0228 Preemptive  0000000142AD5950:0000000142AD6FA0 000000012080BA00 -00001 Ukn
  22   23    55975 0000000120788A00  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  23   24    55976 0000000108941400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  24   25    55978 00000001219B9600  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  25   26    5598b 00000001207B8600  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  26   27    5598c 000000012006E400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   28    559d7 000000012278B400    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  27   29    559f0 000000012276B800    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  28   30    559f2 0000000120784400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  29   31    559f3 0000000118729400  3021228 Preemptive  0000000142A9B0D8:0000000142A9BD30 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   32    55991 0000000122733200    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  30   33    559f6 0000000120369400  2021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  31   34    559f7 0000000118232400    a1228 Preemptive  000000014294E360:000000014294FE00 000000012080BA00 -00001 Ukn
  32   35    559ff 0000000161F1BE00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  39   36    55a06 000000010894CC00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  38   37    55a05 0000000118727400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  34   38    55a01 0000000121896200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  35   39    55a02 00000001208A7800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (000000014266ac48)
  36   40    55a03 00000001219B5400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  37   41    55a04 000000011896EA00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  33   42    55a00 0000000161D34600  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  40   43    55a08 00000001219CD000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  41   44    55a09 000000011864A000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  42   45    55a0a 0000000118982C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  44   46    55a0c 00000001219DC200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  43   47    55a0b 0000000166090C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  45   48    55a0d 0000000118980E00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  46   49    55a0e 0000000121195600  2021028 Preemptive  00000001429BDD90:00000001429BEA78 000000012080BA00 -00001 Ukn
  47   50    55a0f 0000000118957C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  48   51    55a10 0000000108949800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  49   53    55a11 00000001211B1800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  50   54    55a12 000000012088D200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142732488)
  51   52    55a13 0000000108959000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  52   55    55a14 0000000161C88C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  54   56    55a17 0000000118999800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  55   57    55a19 00000001211AF400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142665878)
  56   58    55a1a 00000001186B3200    21028 Preemptive  0000000142B10178:0000000142B10190 000000012080BA00 -00001 Ukn (GC)
  57   59    55a1b 00000001219CA000    21028 Preemptive  00000001429D6388:00000001429D6AE0 000000012080BA00 -00001 Ukn
  58   60    55a1d 0000000108959800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  59   61    55a1e 0000000118961000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  60   62    55a1f 000000010895D200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  61   63    55a21 000000010895E800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  62   64    55a22 0000000118639A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  63   65    55a25 000000011899D000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  64   66    55a26 0000000108961A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  65   67    55a27 000000011899D800    a1028 Preemptive  0000000142B09780:0000000142B0A158 000000012080BA00 -00001 Ukn
  66   68    55a28 00000001189A1600  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  67   69    55a29 00000001189A0C00    21028 Preemptive  0000000142B0B408:0000000142B0C158 000000012080BA00 -00001 Ukn
  68   70    55a2d 00000001189A2600  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  69   71    55a2e 000000011899F200  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  70   72    55a31 00000001219E1400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  71   73    55a32 0000000161F6F000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  72   74    55a33 0000000108962200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  73   75    55a34 0000000108963C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  74   76    55a35 00000001186CDA00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  75   77    55a36 000000011870C200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  76   78    55a37 000000010896AE00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  77   79    55a38 0000000118840A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  78   80    55a39 0000000118734400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  79   81    55a3a 00000001189A4800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  80   82    55a3b 0000000166036C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  81   83    55a3c 0000000165873400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  82   84    55a3d 0000000118735A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  83   85    55a48 0000000165922200  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   86    55990 00000001189A7400    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   87    55a16 00000001219DA600    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  53   88    55a15 0000000161D55000    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   89    55993 000000011898D000    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  85   90    55a6e 0000000108965400  10a1228 Preemptive  0000000142A9C7F8:0000000142A9DD30 000000012080BA00 -00001 Ukn (Threadpool Worker)
  86   91    55a70 0000000108970200  10a1228 Preemptive  0000000142B01AB8:0000000142B02158 000000012080BA00 -00001 Ukn (Threadpool Worker)
  87   92    55a97 0000000161FD2E00  10a1228 Preemptive  0000000142AB68B0:0000000142AB7F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
  88   93    55af2 0000000161FA4E00    21228 Preemptive  0000000142A4E3B8:0000000142A4F568 000000012080BA00 -00001 Ukn
  90   94    55afe 000000016588CC00    21228 Preemptive  0000000142A5C810:0000000142A5D598 000000012080BA00 -00001 Ukn
  89   95    55afd 00000001219F1E00  10a1228 Preemptive  0000000142ADADF0:0000000142ADAFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  91   96    55b04 00000001189ACC00  1121228 Preemptive  0000000142AF0120:0000000142AF0120 000000012080BA00 -00001 Ukn (Threadpool Worker)
  92   97    55b05 0000000108978800  10a1228 Preemptive  000000014291DC78:000000014291DD20 000000012080BA00 -00001 Ukn (Threadpool Worker)
  93   98    55b06 00000001211D9E00  10a1228 Preemptive  0000000142915788:0000000142915D20 000000012080BA00 -00001 Ukn (Threadpool Worker)
  94   99    55b07 0000000108966600  10a1228 Preemptive  00000001429B20C8:00000001429B2A78 000000012080BA00 -00001 Ukn (Threadpool Worker)
  95  100    55bbb 00000001660E1600  1021228 Preemptive  0000000142988770:00000001429889D0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  96  101    55bbd 0000000121A02400  10a1228 Preemptive  0000000142AFF3C0:0000000142B00158 000000012080BA00 -00001 Ukn (Threadpool Worker)
  98  102    55be2 0000000165905600  10a1228 Preemptive  0000000142ADB158:0000000142ADCFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  97  103    55be1 0000000118634600  10a1228 Preemptive  000000014293F170:000000014293FE00 000000012080BA00 -00001 Ukn (Threadpool Worker)
 100  104    55c02 000000016596C000  10a1228 Preemptive  00000001429D9050:00000001429DAAE0 000000012080BA00 -00001 Ukn (Threadpool Worker) <Invalid Object> (00000001429d8af8) (nested exceptions)
  99  105    55bfa 0000000166022000  10a1228 Preemptive  0000000142AA2978:0000000142AA3F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
 101  106    55c0c 0000000161F92200  10a1228 Preemptive  0000000142AE5B70:0000000142AE6968 000000012080BA00 -00001 Ukn (Threadpool Worker)
 105  107    55c1c 0000000161F94800  10a1228 Preemptive  0000000142AA64B0:0000000142AA7F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
 106  108    55c1d 0000000165914A00  10a1228 Preemptive  0000000142AD3408:0000000142AD4FA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
 102  109    55c15 00000001211DCE00  10a1228 Preemptive  0000000142AE6A10:0000000142AE8968 000000012080BA00 -00001 Ukn (Threadpool Worker)
 103  110    55c16 0000000165915200  10a1228 Preemptive  0000000142AE8A10:0000000142AEA968 000000012080BA00 -00001 Ukn (Threadpool Worker)
 104  111    55c17 0000000161F95000  10a1228 Preemptive  0000000142AF01C8:0000000142AF2120 000000012080BA00 -00001 Ukn (Threadpool Worker)
 107  112    55c25 00000001200B8600  10a1228 Preemptive  00000001429DAFA8:00000001429DCAE0 000000012080BA00 -00001 Ukn (Threadpool Worker)
 108  113    55c26 00000001189C1C00    21228 Preemptive  0000000142AFE038:0000000142AFE158 000000012080BA00 -00001 Ukn
 109  114    55c2e 00000001659BAC00    21228 Preemptive  0000000142A7C538:0000000142A7D620 000000012080BA00 -00001 Ukn
 110  115    591ea 0000000161F19A00    21220 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
 111  116    591ec 00000001211EE600    21220 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
   4  117    55393 000000011877CA00    20220 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
filipnavara commented 2 years ago

when you say it behaves much better, do you actually mean that it still hangs sometimes?

I had just couple of minutes to try so I cannot tell for sure one way or another. It was not behaving jerky and I didn't hit a hang outside of debugger yet.

I've already been at this for couple of hours so I cannot tell for certain what I did before the freeze. The application was definitely freezing a lot under the debugger so I may have tried to pause it after the livelock occured.

janvorli commented 2 years ago

From what I can see in the dump, it was actually not stuck waiting for the runtime to suspend. It was just starting the suspension here: https://github.com/dotnet/runtime/blob/048da75f59c975e9bf3c346bee4ce4a9edd11e78/src/coreclr/vm/threadsuspend.cpp#L3308

So I wonder what's causing the hang. I can see the thread 110 also wanting to suspend runtime in ep_rt_coreclr_sample_profiler_write_sampling_event_for_threads. I wonder if there is some kind of suspension train happening between the GC suspension and the profiling suspension. As can be seen above, there is over 100 threads running so maybe it takes the profiling a long time to walk all the stacks of the threads and then a time to take a next sample comes right after or something along these lines. The sampling rate seems to be set to 1ms.

I also wonder why is the profiler sampling running. sampling_thread

@filipnavara how did you get the dump? Have you used the createdump tool? It seems that the dump is missing some memory areas and SOS cannot find managed function names in most cases, which is usually due to the dump taken by the OS (and maybe lldb).

filipnavara commented 2 years ago

The dump was created with createdump manually. The sampling profiler was likely a futile attempt of me to diagnose the already stuck process. I'll try to reproduce the same thing without all these extra tools.

filipnavara commented 2 years ago

Here's a new one without the sampling profiler (at least I didn't invoke it externally) and with createdump --full:

sample.txt coredump

I run it to the point where it appeared freezing. Then I hit "pause" in the debugger which never finished. After that I manually run sample and createdump.

janvorli commented 2 years ago

Great, thank you! I'll take a look.

janvorli commented 2 years ago

@filipnavara the dump looks similar to the previous one, except for the sampling profiler stuff not being there (as expected) and the managed method names being shown correctly via SOS clrstack now. It still has many threads with the TS_DebugSuspendPending state. So it is remotely possible it is some problem related to the debugger thread suspension. Could you please try to repro the problem without the debugger? The coredump should be able to take a dump of a live process. This would help me to find out where to focus the investigations.

It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress. There is one more thing that's the same between those two dumps - in both cases the thread that's suspending for GC is in the libsystem_kernel.dylib task_threads called from the FlushProcessWriteBuffers, which is kind of a strange coincidence. I wonder if it is possible that this OS call ends up waiting for something forever for some reason.

filipnavara commented 2 years ago

So it is remotely possible it is some problem related to the debugger thread suspension. Could you please try to repro the problem without the debugger?

I will try but so far that has been unsuccessful.

It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress.

Oh, I still have the process from morning running so I can do more dumps. One of the threads is stuck in a busy loop.

filipnavara commented 2 years ago

Two more dumps of the process from the morning: 1 2

filipnavara commented 2 years ago

I tried to attach with LLDB and put couple of breakpoints but none of them seem to be hit 🤷🏻

1: name = 'RestartEE', locations = 3, resolved = 3, hit count = 0
  1.1: where = libcoreclr.dylib`GCToEEInterface::RestartEE(bool), address = 0x0000000104348dfc, resolved, hit count = 0 
  1.2: where = libcoreclr.dylib`ThreadSuspend::RestartEE(int, int), address = 0x00000001043dd494, resolved, hit count = 0 
  1.3: where = libcoreclr.dylib`standalone::GCToEEInterface::RestartEE(bool), address = 0x00000001043f8948, resolved, hit count = 0 

2: name = 'CORDebuggerAttached', locations = 0 (pending)

3: name = 'ThreadsAtUnsafePlaces', locations = 1, resolved = 1, hit count = 0
  3.1: where = libcoreclr.dylib`Debugger::ThreadsAtUnsafePlaces(), address = 0x00000001044dc1f0, resolved, hit count = 0 

4: name = 'FlushProcessWriteBuffers', locations = 5, resolved = 5, hit count = 0
  4.1: where = libclrjit.dylib`FlushProcessWriteBuffers, address = 0x0000000103a4df18, resolved, hit count = 0 
  4.2: where = libcoreclr.dylib`FlushProcessWriteBuffers, address = 0x00000001041e4228, resolved, hit count = 0 
  4.3: where = libcoreclr.dylib`GCToOSInterface::FlushProcessWriteBuffers(), address = 0x000000010434c5c0, resolved, hit count = 0 
  4.4: where = libdbgshim.dylib`FlushProcessWriteBuffers, address = 0x0000000103763160, resolved, hit count = 0 
  4.5: where = libmscordaccore.dylib`FlushProcessWriteBuffers, address = 0x0000000103f068cc, resolved, hit count = 0 

5: name = 'ThreadSuspend::SuspendRuntime', locations = 1, resolved = 1, hit count = 0
  5.1: where = libcoreclr.dylib`ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON), address = 0x00000001043dbd58, resolved, hit count = 0 

6: name = 'ThreadSuspend::SuspendEE', locations = 1, resolved = 1, hit count = 0
  6.1: where = libcoreclr.dylib`ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON), address = 0x00000001043dd738, resolved, hit count = 0 

(it could be LLDB misbehaving so I would not necessarily draw conclusions from it)

filipnavara commented 2 years ago

Nevertheless, there is something interesting in the LLDB stack trace:

(lldb) thread select 62
* thread #62, name = '.NET ThreadPool Worker'
    frame #0: 0x0000000192345a5c libsystem_kernel.dylib`swtch_pri + 8
libsystem_kernel.dylib`swtch_pri:
->  0x192345a5c <+8>: ret    

libsystem_kernel.dylib`swtch:
    0x192345a60 <+0>: mov    x16, #-0x3c
    0x192345a64 <+4>: svc    #0x80
    0x192345a68 <+8>: ret    
(lldb) bt
* thread #62, name = '.NET ThreadPool Worker'
  * frame #0: 0x0000000192345a5c libsystem_kernel.dylib`swtch_pri + 8
    frame #1: 0x000000019237e0e4 libsystem_pthread.dylib`cthread_yield + 20
    frame #2: 0x00000001041e7ec8 libcoreclr.dylib`MachSetThreadContext + 52
    frame #3: 0x00000001041e7e60 libcoreclr.dylib`PAL_DispatchException + 292
    frame #4: 0x00000001041e7a54 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
    frame #5: 0x00000001044dc1f0 libcoreclr.dylib`Debugger::SetModuleDefaultJMCStatus(Module*, bool) + 284
    frame #6: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(SUSPEND_REASON) + 48
    frame #7: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 420
    frame #8: 0x000000010446dde0 libcoreclr.dylib`WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) + 792
    frame #9: 0x0000000104490b58 libcoreclr.dylib`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 164
    frame #10: 0x000000010434df70 libcoreclr.dylib`AllocateObject(MethodTable*) + 216
    frame #11: 0x0000000104369bdc libcoreclr.dylib`JIT_New(CORINFO_CLASS_STRUCT_*) + 124

... and after putting few more breakpoints I started getting this:

* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x34c19bef8)
  * frame #0: 0x00000001923981b8 libsystem_platform.dylib`_platform_memmove + 168
    frame #1: 0x00000001041e8428 libcoreclr.dylib`SEHExceptionThread(void*) + 1372
    frame #2: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148

and the process died :/

janvorli commented 2 years ago

The crash in memmove migh be related to a stack overflow handling issue I've fixed few days ago. Just a wild guess though, maybe fiddling with the debugging resulted in stack overflow somewhere somehow.

Anyways, the two extra dumps have revealed that the thread that tries to suspend the runtime is actually progressing. In one case, it was running in SuspendRuntime, in other in RestartEE:

case 1:

(lldb) bt
* thread #62, stop reason = signal SIGSTOP
  * frame #0: 0x0000000192345954 libsystem_kernel.dylib`mach_msg_trap + 8
    frame #1: 0x0000000192345d00 libsystem_kernel.dylib`mach_msg + 76
    frame #2: 0x000000019234e23c libsystem_kernel.dylib`thread_get_state + 140
    frame #3: 0x00000001923558a0 libsystem_kernel.dylib`thread_get_register_pointer_values + 132
    frame #4: 0x00000001041e4294 libcoreclr.dylib`::FlushProcessWriteBuffers() at process.cpp:3569:23 [opt]
    frame #5: 0x00000001043dbe6c libcoreclr.dylib`ThreadSuspend::SuspendRuntime(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:3297:5 [opt]
    frame #6: 0x00000001043dd858 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5702:5 [opt]
    frame #7: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=<unavailable>) at gcenv.ee.cpp:26:5 [opt]
    frame #8: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:44676:9 [opt]

case 2:

(lldb) bt
* thread #62, stop reason = signal SIGSTOP
  * frame #0: 0x0000000192345954 libsystem_kernel.dylib`mach_msg_trap + 8
    frame #1: 0x0000000192345d00 libsystem_kernel.dylib`mach_msg + 76
    frame #2: 0x000000019234e23c libsystem_kernel.dylib`thread_get_state + 140
    frame #3: 0x00000001923558a0 libsystem_kernel.dylib`thread_get_register_pointer_values + 132
    frame #4: 0x00000001041e4294 libcoreclr.dylib`::FlushProcessWriteBuffers() at process.cpp:3569:23 [opt]
    frame #5: 0x00000001043dd4e4 libcoreclr.dylib`ThreadSuspend::RestartEE(bFinishedGC=NO, SuspendSucceded=NO) at threadsuspend.cpp:5523:5 [opt]
    frame #6: 0x00000001043dd890 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5736:9 [opt]
    frame #7: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=<unavailable>) at gcenv.ee.cpp:26:5 [opt]
    frame #8: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:44676:9 [opt]

So it seems it is spinning in the ThreadSuspend::SuspendEE in this loop: https://github.com/dotnet/runtime/blob/4d39501439d438b9c749c7d2bca5b3c1a60c823b/src/coreclr/vm/threadsuspend.cpp#L5661-L5768

janvorli commented 2 years ago

I keep investigating ...

janvorli commented 2 years ago

The g_pDebugInterface->ThreadsAtUnsafePlaces() is returning true because the underlying Debugger::m_threadsAtUnsafePlaces is 2. It is possible that there is a bug causing an imbalance in this count (we maintain the count by incrementing / decrementing it in the debugger related code) or that there is a bug in evaluation of the unsafe places in the code called from Debugger::IsThreadAtSafePlace.

filipnavara commented 2 years ago

Getting stuck in that loop seems to be in line with the fact that I can easily reproduce it with a debugger but not without one. That's also why I set the specific breakpoints in LLDB. I assume the process was already somehow broken so the debugger didn't work correctly and did not hit them. I'll check if there's some way to provide more logging around m_threadsAtUnsafePlaces.

janvorli commented 2 years ago

I have also found that there was a strange exception attached to one of the threads:

(lldb) pe
Exception object: 0000000131c8db28
Exception type:   System.Security.SecurityException
Message:          ECall methods must be packaged into a system module.
InnerException:   <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 8013150a

@jkotas do you have any idea what could this mean? I am not sure how a FCall could occur in something else than S.P.C.

janvorli commented 2 years ago

@filipnavara we could add stresslog logging at the places where the m_threadsAtUnsafePlaces is incremented and decremented (we would need to put it at the call sites of the Inc/DecThreadsAtUnsafePlaces so that we can log the thread pointer). Like this (at some places, the variable holding the thread pointer is thread and at some pThread

STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling IncThreadsAtUnsafePlaces for thread %p\n", thread);
STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling DecThreadsAtUnsafePlaces for thread %p\n", thread);

Would you like me to compile a patched version of libcoreclr.dylib for you or would you do that yourself?

filipnavara commented 2 years ago

I can compile it myself.

janvorli commented 2 years ago

Great! Once you have the patched version, then before running your app, couple of COMPlus variables need to be set to enable the stresslog. This is what I usually do - it sets the maximum size of the stresslog and enables it for the LF_CORDBG facility: COMPlus_StressLog=1 COMPlus_LogLevel=A COMPlus_LogFacility=200 COMPlus_StressLogSize=2000000 COMPlus_TotalStressLogSize=40000000

janvorli commented 2 years ago

This ends up recording the stresslog events in memory and they are later dumped into the crash dump and I can dump them using the dumplog SOS command.

jkotas commented 2 years ago

@jkotas do you have any idea what could this mean? I am not sure how a FCall could occur in something else than S.P.C.

The built-in COM interop uses [MethodImpl(MethodImplOptions.InternalCall)] extern methods. The runtime fills in the implementation on Windows. Built-in COM interop does not exist on non-Windows and you can end up with this exception if you try to execute the Windows-specific COM interop code. I guess the app has Windows-specific code that uses COM interop that is wrapped inside try/catch.

For reference, this is an example that will generate this exception:

using System.Runtime.CompilerServices;

foo();

[MethodImpl(MethodImplOptions.InternalCall)]
static extern void foo();
filipnavara commented 2 years ago

I guess the app has Windows-specific code that uses COM interop that is wrapped inside try/catch.

I will check on that. We do have COM code but it should be compiled-out on the macOS builds. It's possible that we missed some place though.

filipnavara commented 2 years ago

Sorry, had a bit of a pain with Xcode and Debug/Release builds (the Debug ones do some additional validation that is extremely slow in combination with how Xamarin does binding calls).

Hopefully the dump should contain the stress logs. This time I didn't even pause it in the debugger [I think].

sample.48456.txt coredump.48456.zip libmscor.zip

I left the process running just in case there's something that can be inspected with LLDB.

janvorli commented 2 years ago

Thank you, I am taking a look.

janvorli commented 2 years ago

@filipnavara I've forgotten that I'll need the libcoreclr.dylib, libcoreclr.dylib.dwarf and libmscordaccore.dylib.dwarf files that you've rebuilt to enable SOS work with the dump. Could you please share those with me? You can also pack all the .dylib and .dylib.dwarf files.

filipnavara commented 2 years ago

Ah, I expected that but I packed just couple of them in the post above 😅 I will pack the rest and upload it in a moment.

filipnavara commented 2 years ago

runtime-files.zip

janvorli commented 2 years ago

It is interesting. The m_threadsAtUnsafePlaces = 0x00000004, but dumping all the threads reported in the stresslog, there is only one thread that has inc and doesn't have dec - the 000000033DB1AC00 .

Are you sure you've put the logging at all places where we call the IncThreadsAtUnsafePlaces and DecThreadsAtUnsafePlaces? I can see there are 5 such places for each.

Based on the date stamps in the log, it seems the log is not missing any records.

janvorli commented 2 years ago

But even looking at this single thread, I think it revealed the root of issue. There is a hardware exception being processed on this thread and the debugger machinery hook for the exception ended up waiting for the runtime suspension event. The problem is that the IncThreadsAtUnsafePlaces was called at this thread, but the corresponding DecThreadsAtUnsafePlaces was not called yet. So the runtime suspension can never succeed, since there is at least 1 in the counter of threads at unsafe places. Hence the deadlock.

(lldb) bt
* thread #119, stop reason = signal SIGSTOP
  * frame #0: 0x00000001923490c0 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000192381808 libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000101d5be1c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000033db1ca38, dwTimeout=4294967295, ptw                             rWakeupReason=0x00000003498012fc, pdwSignaledObject=0x00000003498012f8) at synchmanager.cpp:478:28 [opt]
    frame #3: 0x0000000101d5ba8c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0000000125008e00, pthrCurrent=0x000000033db1c800, dwTimeout=429                             4967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x0000000349801388, pdwSignaledObject=0x000000034980138c) at synchmanager.cpp:301:22 [opt]
    frame #4: 0x0000000101d5fba0 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000033db1c800, nCount=<unavailable>, lpHandles=<unavailable>, bW                             aitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]
    frame #5: 0x0000000101d5fd80 libcoreclr.dylib`::WaitForSingleObjectEx(hHandle=0x00000000000018c8, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:138:13 [opt]
    frame #6: 0x0000000101f55c2c libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper2(handle=<unavailable>, dwMillis                             econds=<unavailable>, alertable=<unavailable>) at synch.cpp:372:12 [opt]
    frame #7: 0x0000000101f55c28 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper(this=<unavailable>, pParam=0x00                             00000349801588)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:397:26 [opt]
    frame #8: 0x0000000101f55c20 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at synch.cpp:399:5 [opt]
    frame #9: 0x0000000101f55bc8 libcoreclr.dylib`CLREventBase::WaitEx(this=<unavailable>, dwMilliseconds=<unavailable>, mode=<unavailable>, syncState=<unavailable>) at                              synch.cpp:466:20 [opt]
    frame #10: 0x0000000101f5b8d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper(this=0x000000033db1ac00) at threadsuspend.cpp:4626:50 [opt]
    frame #11: 0x0000000101f5a538 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() at threadsuspend.cpp:4663:13 [opt]
    frame #12: 0x0000000101f5a530 libcoreclr.dylib`Thread::RareEnablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2414:13 [opt]
    frame #13: 0x0000000101f59e58 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() [inlined] Thread::EnablePreemptiveGC(this=0x000000033db1ac00) at threads.h:2044:13 [                             opt]
    frame #14: 0x0000000101f59e3c libcoreclr.dylib`Thread::RareDisablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2147:13 [opt]
    frame #15: 0x000000010204ccc8 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHold                             erEEInterface<0, 1, 1>::LeaveInternal(this=<unavailable>) at debugger.h:260:29 [opt]
    frame #16: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHold                             erEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:303:15 [opt]
    frame #17: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHold                             erEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:300:5 [opt]
    frame #18: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(thread=0x000000033db1ac00, context=0x000000033c1f7000, address="\xfd{\xb                             e\xa9\xf3S\U00000001\xa9\xfd\U00000003", which=<unavailable>) at controller.cpp:3028:9 [opt]
    frame #19: 0x000000010204decc libcoreclr.dylib`DebuggerController::DispatchNativeException(pException=<unavailable>, pContext=0x000000033c1f7000, dwCode=2147483651,                              pCurThread=0x000000033db1ac00) at controller.cpp:0:24 [opt]
    frame #20: 0x000000010205a5e4 libcoreclr.dylib`Debugger::FirstChanceNativeException(this=0x0000000124707ae0, exception=0x000000033c1f7390, context=0x000000033c1f7000                             , code=2147483651, thread=0x000000033db1ac00) at debugger.cpp:5574:18 [opt]
    frame #21: 0x0000000101f87914 libcoreclr.dylib`HandleHardwareException(ex=0x0000000349801ca0) at exceptionhandling.cpp:5313:36 [opt]
    frame #22: 0x0000000101d34f3c libcoreclr.dylib`SEHProcessException(exception=0x0000000349801ca0) at seh.cpp:267:21 [opt]
    frame #23: 0x0000000101d68e88 libcoreclr.dylib`::PAL_DispatchException(pContext=0x00000003498028c0, pExRecord=0x0000000349802828, pMachExceptionInfo=0x00000003498022                             e8) at machexception.cpp:382:29 [opt]
    frame #24: 0x0000000101d68b18 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
    frame #25: 0x00000002888f1358
    frame #26: 0x0000000101e880c8 libcoreclr.dylib`DispatchCallSimple(unsigned long*, unsigned int, unsigned long, unsigned int) at callhelpers.cpp:67:5 [opt]
    frame #27: 0x0000000101e8806c libcoreclr.dylib`DispatchCallSimple(pSrc=<unavailable>, numStackSlotsToCopy=<unavailable>, pTargetAddress=<unavailable>, dwDispatchCall                             SimpleFlags=<unavailable>) at callhelpers.cpp:220:9 [opt]
    frame #28: 0x0000000101e9d180 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(ptr=<unavailable>) at comsynchronizable.cpp:157:5 [opt]
    frame #29: 0x0000000101e53184 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavail                             able>) at threads.cpp:7321:5 [opt]
    frame #30: 0x0000000101e53180 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7365:9 [opt]
    frame #31: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>,                              pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.                             cpp:7523:13 [opt]
    frame #32: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7525:9 [opt]
    frame #33: 0x0000000101e53114 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x0000000349802f08) at threads.cpp:7549:5 [opt]
    frame #34: 0x0000000101e536f0 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, a                             rgs=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7569:5 [opt]
    frame #35: 0x0000000101e536dc libcoreclr.dylib`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7604:5 [opt]
    frame #36: 0x0000000101e9d258 libcoreclr.dylib`ThreadNative::KickOffThread(pass=0x000000033db1ac00) at comsynchronizable.cpp:228:9 [opt]
    frame #37: 0x0000000101d66bc0 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000033db1c800) at thread.cpp:1862:16 [opt]
    frame #38: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148
(lldb) bt
* thread #119, stop reason = signal SIGSTOP
  * frame #0: 0x00000001923490c0 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000192381808 libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000101d5be1c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x000000033db1ca38, dwTimeout=4294967295, ptwrWakeupReason=0x00000003498012fc, pdwSignaledObject=0x00000003498012f8) at synchmanager.cpp:478:28 [opt]
    frame #3: 0x0000000101d5ba8c libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0000000125008e00, pthrCurrent=0x000000033db1c800, dwTimeout=4294967295, fAlertable=false, fIsSleep=<unavailable>, ptwrWakeupReason=0x0000000349801388, pdwSignaledObject=0x000000034980138c) at synchmanager.cpp:301:22 [opt]
    frame #4: 0x0000000101d5fba0 libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x000000033db1c800, nCount=<unavailable>, lpHandles=<unavailable>, bWaitAll=<unavailable>, dwMilliseconds=<unavailable>, bAlertable=<unavailable>, bPrioritize=<unavailable>) at wait.cpp:637:45 [opt]
    frame #5: 0x0000000101d5fd80 libcoreclr.dylib`::WaitForSingleObjectEx(hHandle=0x00000000000018c8, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:138:13 [opt]
    frame #6: 0x0000000101f55c2c libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper2(handle=<unavailable>, dwMilliseconds=<unavailable>, alertable=<unavailable>) at synch.cpp:372:12 [opt]
    frame #7: 0x0000000101f55c28 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) [inlined] CLREventWaitHelper(this=<unavailable>, pParam=0x0000000349801588)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:397:26 [opt]
    frame #8: 0x0000000101f55c20 libcoreclr.dylib`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at synch.cpp:399:5 [opt]
    frame #9: 0x0000000101f55bc8 libcoreclr.dylib`CLREventBase::WaitEx(this=<unavailable>, dwMilliseconds=<unavailable>, mode=<unavailable>, syncState=<unavailable>) at synch.cpp:466:20 [opt]
    frame #10: 0x0000000101f5b8d0 libcoreclr.dylib`Thread::WaitSuspendEventsHelper(this=0x000000033db1ac00) at threadsuspend.cpp:4626:50 [opt]
    frame #11: 0x0000000101f5a538 libcoreclr.dylib`Thread::RareEnablePreemptiveGC() at threadsuspend.cpp:4663:13 [opt]
    frame #12: 0x0000000101f5a530 libcoreclr.dylib`Thread::RareEnablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2414:13 [opt]
    frame #13: 0x0000000101f59e58 libcoreclr.dylib`Thread::RareDisablePreemptiveGC() [inlined] Thread::EnablePreemptiveGC(this=0x000000033db1ac00) at threads.h:2044:13 [opt]
    frame #14: 0x0000000101f59e3c libcoreclr.dylib`Thread::RareDisablePreemptiveGC(this=0x000000033db1ac00) at threadsuspend.cpp:2147:13 [opt]
    frame #15: 0x000000010204ccc8 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHolderEEInterface<0, 1, 1>::LeaveInternal(this=<unavailable>) at debugger.h:260:29 [opt]
    frame #16: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHolderEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:303:15 [opt]
    frame #17: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(Thread*, _CONTEXT*, unsigned char const*, SCAN_TRIGGER) [inlined] GCHolderEEInterface<0, 1, 1>::~GCHolderEEInterface(this=<unavailable>) at debugger.h:300:5 [opt]
    frame #18: 0x000000010204cca0 libcoreclr.dylib`DebuggerController::DispatchPatchOrSingleStep(thread=0x000000033db1ac00, context=0x000000033c1f7000, address="\xfd{\xbe\xa9\xf3S\U00000001\xa9\xfd\U00000003", which=<unavailable>) at controller.cpp:3028:9 [opt]
    frame #19: 0x000000010204decc libcoreclr.dylib`DebuggerController::DispatchNativeException(pException=<unavailable>, pContext=0x000000033c1f7000, dwCode=2147483651, pCurThread=0x000000033db1ac00) at controller.cpp:0:24 [opt]
    frame #20: 0x000000010205a5e4 libcoreclr.dylib`Debugger::FirstChanceNativeException(this=0x0000000124707ae0, exception=0x000000033c1f7390, context=0x000000033c1f7000, code=2147483651, thread=0x000000033db1ac00) at debugger.cpp:5574:18 [opt]
    frame #21: 0x0000000101f87914 libcoreclr.dylib`HandleHardwareException(ex=0x0000000349801ca0) at exceptionhandling.cpp:5313:36 [opt]
    frame #22: 0x0000000101d34f3c libcoreclr.dylib`SEHProcessException(exception=0x0000000349801ca0) at seh.cpp:267:21 [opt]
    frame #23: 0x0000000101d68e88 libcoreclr.dylib`::PAL_DispatchException(pContext=0x00000003498028c0, pExRecord=0x0000000349802828, pMachExceptionInfo=0x00000003498022e8) at machexception.cpp:382:29 [opt]
    frame #24: 0x0000000101d68b18 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
    frame #25: 0x00000002888f1358
    frame #26: 0x0000000101e880c8 libcoreclr.dylib`DispatchCallSimple(unsigned long*, unsigned int, unsigned long, unsigned int) at callhelpers.cpp:67:5 [opt]
    frame #27: 0x0000000101e8806c libcoreclr.dylib`DispatchCallSimple(pSrc=<unavailable>, numStackSlotsToCopy=<unavailable>, pTargetAddress=<unavailable>, dwDispatchCallSimpleFlags=<unavailable>) at callhelpers.cpp:220:9 [opt]
    frame #28: 0x0000000101e9d180 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(ptr=<unavailable>) at comsynchronizable.cpp:157:5 [opt]
    frame #29: 0x0000000101e53184 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7321:5 [opt]
    frame #30: 0x0000000101e53180 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7365:9 [opt]
    frame #31: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7523:13 [opt]
    frame #32: 0x0000000101e53170 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7525:9 [opt]
    frame #33: 0x0000000101e53114 libcoreclr.dylib`ManagedThreadBase_DispatchOuter(pCallState=0x0000000349802f08) at threads.cpp:7549:5 [opt]
    frame #34: 0x0000000101e536f0 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, args=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7569:5 [opt]
    frame #35: 0x0000000101e536dc libcoreclr.dylib`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7604:5 [opt]
    frame #36: 0x0000000101e9d258 libcoreclr.dylib`ThreadNative::KickOffThread(pass=0x000000033db1ac00) at comsynchronizable.cpp:228:9 [opt]
    frame #37: 0x0000000101d66bc0 libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000033db1c800) at thread.cpp:1862:16 [opt]
    frame #38: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148
filipnavara commented 2 years ago

Hmm, I likely missed some places. The search in VS Code definitely showed me less than 5... (not at computer right now to double check)

janvorli commented 2 years ago

This is the code that called the IncThreadsAtUnsafePlaces and would call the corresponding DecThreadsAtUnsafePlaces if it didn't get suspended: https://github.com/dotnet/runtime/blob/2eb4efb129b2c0ed79069ac5dc9a46a64675b85e/src/coreclr/debug/ee/controller.cpp#L2977-L3029

filipnavara commented 2 years ago

Yeah, I placed the logs on this one for sure. Maybe I did some cut & paste error, will check.

UPD:

        // Mark if we're at an unsafe place.
        bool atSafePlace = g_pDebugger->IsThreadAtSafePlace(thread);
        if (!atSafePlace)
        {
            STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling IncThreadsAtUnsafePlaces for thread %p\n", thread);
            g_pDebugger->IncThreadsAtUnsafePlaces();
        }

and

        if (!atSafePlace)
        {
            g_pDebugger->DecThreadsAtUnsafePlaces();
            STRESS_LOG1(LF_CORDB, LL_EVERYTHING, "Calling DecThreadsAtUnsafePlaces for thread %p\n", thread);
        }

but I missed the code below:

    if (used == DPOSS_USED_WITH_EVENT)
    {
        bool atSafePlace = g_pDebugger->IsThreadAtSafePlace(thread);
        if (!atSafePlace)
        {
            g_pDebugger->IncThreadsAtUnsafePlaces();
        }

        // Always pulse the GC mode. This will allow an async break to complete even if we have a patch
        // at an unsafe place.
        // If we are at an unsafe place, then we can't do a GC.
        thread->PulseGCMode();

        if (!atSafePlace)
        {
            g_pDebugger->DecThreadsAtUnsafePlaces();
        }

    }
janvorli commented 2 years ago

The hardware exception being handled is actually a debugger breakpoint.

filipnavara commented 2 years ago

Hmm. I am not quite sure what breakpoint could it be...

image
janvorli commented 2 years ago

I think debugger places its own internal breakpoints in some cases. This seems to be a thread start breakpoint.

janvorli commented 2 years ago

This is the managed part of the stack of the thread:

(lldb) clrstack
OS Thread Id: 0x112683 (119)
        Child SP               IP Call Site
0000000349802C60 00000002888F1358 System.Threading.Thread.StartCallback()
0000000349802E08 0000000102016b48 [DebuggerU2MCatchHandlerFrame: 0000000349802e08]
janvorli commented 2 years ago

As you can see the IncThreadsAtUnsafePlaces is called conditionally - only if the g_pDebugger->IsThreadAtSafePlace(thread) returned false. The question is why it returned false in this case and whether it was correct. The breakpoint is set at the first instruction of the System.Threading.Thread.StartCallback(), that means it is in its prolog.