Several tests fail on mono with mono_threads_pthread_kill: pthread_kill failed with error 11

ahsonkhan commented 4 years ago

From https://github.com/dotnet/runtime/pull/32364 on:

netcoreapp5.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open

See https://dev.azure.com/dnceng/public/_build/results?buildId=522989&view=logs&j=b4344b0d-0f92-5d69-ccaf-e0b24fbf14a2&t=555ea487-e6f6-5e9e-ac96-b1fa34f2540d&l=98

Test areas that fail: https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.IO.FileSystem.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Linq.Queryable.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.Primitives.Functional.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.ServicePoint.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Threading.Timer.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.Linq.xNodeReader.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.RW.XmlWriterApi.Tests/console

For example: https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Composition.AttributeModel.Tests/console

===========================================================================================================
~/work/A8F20901/w/8E960823/e ~/work/A8F20901/w/8E960823/e
  Discovering: System.Composition.AttributeModel.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Composition.AttributeModel.Tests (found 19 test cases)
  Starting:    System.Composition.AttributeModel.Tests (parallel test collections = on, max threads = 2)
mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow

=================================================================
    Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
    Native stacktrace:
=================================================================
    0x7f5d815d7cbd - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8157d7b5 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d815d7355 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d82e87890 - /lib/x86_64-linux-gnu/libpthread.so.0 : 
    0x7f5d81f7fe97 - /lib/x86_64-linux-gnu/libc.so.6 : gsignal
    0x7f5d81f81801 - /lib/x86_64-linux-gnu/libc.so.6 : abort
    0x7f5d81786b35 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8176acd0 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d81786f25 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d81786fbe - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : monoeg_g_log
    0x7f5d8177e313 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8177ac82 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8171b88e - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8172e92d - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8172b66f - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8172b5d3 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d81720f6e - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d817213e9 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x7f5d8171d3c6 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
    0x414f0f87 - Unknown

=================================================================
    Telemetry Dumper:
=================================================================
Pkilling 0x140039405434624x from 0x140039413958400x
Pkilling 0x140039540684544x from 0x140039413958400x
Pkilling 0x140039411816192x from 0x140039413958400x
Pkilling 0x140039609288512x from 0x140039413958400x
Pkilling 0x140039416076032x from 0x140039413958400x
Pkilling 0x140039409698560x from 0x140039413958400x
Pkilling 0x140039407552256x from 0x140039413958400x
Entering thread summarizer pause from 0x140039413958400x
Finished thread summarizer pause from 0x140039413958400x.
Could not exec mono-hang-watchdog, expected on path '/usr/local/etc/../bin/mono-hang-watchdog' (errno 2)

Waiting for dumping threads to resume

=================================================================
    External Debugger Dump:
=================================================================
[New LWP 27863]
[New LWP 27864]
[New LWP 27865]
[New LWP 27868]
[New LWP 27869]
[New LWP 27870]
[New LWP 27871]
[New LWP 27872]
[New LWP 27873]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555946394848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88  ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
  Id   Target Id         Frame 
* 1    Thread 0x7f5d832a8740 (LWP 27862) "dotnet" 0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555946394848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  2    Thread 0x7f5d80bff700 (LWP 27863) "SGen worker" 0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f5d81a7c798 <work_cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7f5d7f13b700 (LWP 27864) "Finalizer" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  4    Thread 0x7f5d77b0c700 (LWP 27865) "dotnet" 0x00007f5d82e86384 in __libc_read (fd=5, buf=0x7f5d77b0bee7, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:27
  5    Thread 0x7f5d77860700 (LWP 27868) "dotnet" 0x00007f5d82e8723a in __waitpid (pid=27875, stat_loc=0x7f5d7785d2bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
  6    Thread 0x7f5d77a65700 (LWP 27869) "dotnet" 0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d77a64060, expected=0, futex_word=0x555946394998) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  7    Thread 0x7f5d77655700 (LWP 27870) "dotnet" 0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d60000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  8    Thread 0x7f5d77450700 (LWP 27871) "dotnet" 0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d7744f6a0, expected=0, futex_word=0x7f5d7744f7c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  9    Thread 0x7f5d77244700 (LWP 27872) "dotnet" 0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d58000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  10   Thread 0x7f5d7703f700 (LWP 27873) "dotnet" 0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d5c000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

Thread 10 (Thread 0x7f5d7703f700 (LWP 27873)):
#0  0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d5c000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f5d5c000b80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f5d82e857c8 in __new_sem_wait_slow (sem=0x7f5d5c000b80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f5d81779148 in mono_os_sem_wait (sem=0x7f5d5c000b80, flags=MONO_SEM_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-os-semaphore.h:203
#4  mono_thread_info_wait_for_resume (info=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:232
#5  0x00007f5d8177ec00 in mono_threads_state_poll_with_info (info=0x7f5d5c000b20) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:149
#6  0x00007f5d8177edce in mono_threads_enter_gc_safe_region_unbalanced_with_info (info=0x7f5d5c000b20, stackdata=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:328
#7  0x00007f5d8163a5e5 in mono_coop_mutex_lock (mutex=0x5559463b8570) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:55
#8  mono_domain_lock (domain=0x5559463b8570) at /__w/1/s/src/mono/mono/metadata/domain.c:2038
#9  0x00007f5d816afd0f in mono_method_add_generic_virtual_invocation (domain=0x5559463b8570, vtable=0x7f5d5c03f868, vtable_slot=0x7f5d5c03f850, method=0x7f5d680cbf00, code=0x41626d10) at /__w/1/s/src/mono/mono/metadata/object.c:1827
#10 0x00007f5d815801aa in common_call_trampoline (regs=<optimized out>, code=<optimized out>, m=0x7f5d580097e0, vt=<optimized out>, vtable_slot=0x7f5d5c03f850, error=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:655
#11 0x00007f5d81580e39 in mono_vcall_trampoline (regs=0x7f5d7703d998, code=0x414f4ddc "H\211E\320\351S", slot=-3, tramp=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:856
#12 0x00000000413b9293 in ?? ()
#13 0x00007f5d5c002585 in ?? ()
#14 0x00007f5d80f9bb88 in ?? ()
#15 0x00007f5d7703db60 in ?? ()
#16 0x00007f5d7703d998 in ?? ()
#17 0x00007f5d5c001d20 in ?? ()
#18 0x00007f5d5c03f868 in ?? ()
#19 0x00007f5d80ed4498 in ?? ()
#20 0x00007f5d80006f77 in ?? ()
#21 0x00007f5d80f7a838 in ?? ()
#22 0x00007f5d7703db60 in ?? ()
#23 0x00007f5d7703dbc0 in ?? ()
#24 0x0000000000000002 in ?? ()
#25 0x00007f5d80febb48 in ?? ()
#26 0x00000000000007b4 in ?? ()
#27 0x0000000000000001 in ?? ()
#28 0x00007f5d680cbf00 in ?? ()
#29 0x00007f5d8164723f in mono_class_is_assignable_from_internal (klass=<optimized out>, oklass=<optimized out>) at /__w/1/s/src/mono/mono/metadata/class.c:3631
#30 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f5d77244700 (LWP 27872)):
#0  0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d58000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f5d58000b80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f5d82e857c8 in __new_sem_wait_slow (sem=0x7f5d58000b80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f5d81779148 in mono_os_sem_wait (sem=0x7f5d58000b80, flags=MONO_SEM_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-os-semaphore.h:203
#4  mono_thread_info_wait_for_resume (info=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:232
#5  0x00007f5d8177eefe in mono_threads_exit_gc_safe_region_unbalanced_internal (cookie=0x7f5d58000b20, stackdata=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:391
#6  0x00007f5d8163a5ff in mono_coop_mutex_lock (mutex=0x5559463b8570) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:59
#7  mono_domain_lock (domain=0x5559463b8570) at /__w/1/s/src/mono/mono/metadata/domain.c:2038
#8  0x00007f5d8163a7dd in mono_domain_code_reserve (domain=0x5559463b8570, size=1792) at /__w/1/s/src/mono/mono/metadata/domain.c:1434
#9  0x00007f5d814e610d in mono_codegen (cfg=0x7f5d5819ee20) at /__w/1/s/src/mono/mono/mini/mini.c:2244
#10 0x00007f5d814e84e7 in mini_method_compile (method=<optimized out>, opts=<optimized out>, domain=<optimized out>, flags=<optimized out>, parts=<optimized out>, aot_method_index=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini.c:3866
#11 0x00007f5d814e96eb in mono_jit_compile_method_inner (method=0x7f5d58083a90, target_domain=0x5559463b8570, opt=393, error=0x7f5d7723fca8) at /__w/1/s/src/mono/mono/mini/mini.c:4060
#12 0x00007f5d814ed046 in mono_jit_compile_method_with_opt (method=0x7f5d58083a90, opt=<optimized out>, jit_only=0, error=0x7f5d7723fca8) at /__w/1/s/src/mono/mono/mini/mini-runtime.c:2448
#13 0x00007f5d815800fa in common_call_trampoline (regs=<optimized out>, code=<optimized out>, m=0x7f5d58083a90, vt=<optimized out>, vtable_slot=<optimized out>, error=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:635
#14 0x00007f5d81580e39 in mono_vcall_trampoline (regs=0x7f5d7723fd58, code=0x41656d4c "L\213\360H\205\300\017\205)", slot=8, tramp=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:856
#15 0x00000000413b9293 in ?? ()
#16 0x00007f5d77240a1d in ?? ()
#17 0x00007f5d80006e85 in ?? ()
#18 0x00007f5d7723ff20 in ?? ()
#19 0x00007f5d7723fd58 in ?? ()
#20 0x00007f5d58001d20 in ?? ()
#21 0x00007f5d5c0deca0 in ?? ()
#22 0x0000000000000001 in ?? ()
#23 0x0000000000000001 in ?? ()
#24 0x00007f5d80c01b40 in ?? ()
#25 0x00007f5d7723ff20 in ?? ()
#26 0x00007f5d7723fff0 in ?? ()
#27 0x00007f5d80c84d20 in ?? ()
#28 0x00007f5d80fd0790 in ?? ()
#29 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f5d77450700 (LWP 27871)):
#0  0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d7744f6a0, expected=0, futex_word=0x7f5d7744f7c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7f5d7744f758, mutex=0x7f5d680209c0, cond=0x7f5d7744f7a0) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x7f5d7744f7a0, mutex=0x7f5d680209c0, abstime=0x7f5d7744f758) at pthread_cond_wait.c:667
#3  0x00007f5d81770d64 in mono_os_cond_timedwait (cond=0x7f5d7744f7a0, mutex=0x7f5d680209c0, timeout_ms=20000) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.c:75
#4  0x00007f5d817772b4 in mono_coop_cond_timedwait (cond=0x7f5d7744f7a0, mutex=<optimized out>, timeout_ms=20000) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:103
#5  mono_lifo_semaphore_timed_wait (semaphore=0x7f5d680209c0, timeout_ms=20000) at /__w/1/s/src/mono/mono/utils/lifo-semaphore.c:48
#6  0x00000000415f9e16 in ?? ()
#7  0x00007f5d80f86518 in ?? ()
#8  0x0000000000000046 in ?? ()
#9  0x0000000000000002 in ?? ()
#10 0x00007f5d80f82d58 in ?? ()
#11 0x0000000000004e20 in ?? ()
#12 0x00007f5d64002580 in ?? ()
#13 0x00007f5d7744fbe0 in ?? ()
#14 0x00007f5d7744f820 in ?? ()
#15 0x00007f5d7744fbe0 in ?? ()
#16 0x00000000415f9d78 in ?? ()
#17 0x00007f5d80f82d58 in ?? ()
#18 0x0000000000004e20 in ?? ()
#19 0x00007f5d7744fbe0 in ?? ()
#20 0x00000000415f9a1c in ?? ()
#21 0x00007f5d80f82d58 in ?? ()
#22 0x0000000000004e20 in ?? ()
#23 0x00007f5d64000b20 in ?? ()
#24 0x0000000000000002 in ?? ()
#25 0x00007f5d80f86518 in ?? ()
#26 0x0000000000000044 in ?? ()
#27 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f5d77655700 (LWP 27870)):
#0  0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d60000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f5d60000b80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f5d82e857c8 in __new_sem_wait_slow (sem=0x7f5d60000b80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f5d81779148 in mono_os_sem_wait (sem=0x7f5d60000b80, flags=MONO_SEM_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-os-semaphore.h:203
#4  mono_thread_info_wait_for_resume (info=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:232
#5  0x00007f5d8177eefe in mono_threads_exit_gc_safe_region_unbalanced_internal (cookie=0x7f5d60000b20, stackdata=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:391
#6  0x00007f5d8177bb3d in mono_coop_cond_timedwait (cond=<optimized out>, mutex=<optimized out>, timeout_ms=500) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:105
#7  sleep_interruptable (ms=500, alerted=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:1640
#8  mono_thread_info_sleep (ms=500, alerted=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:1672
#9  0x00007f5d816c172e in mono_sleep_internal (ms=<optimized out>, error=<optimized out>, allow_interruption=<optimized out>) at /__w/1/s/src/mono/mono/metadata/threads.c:1806
#10 ves_icall_System_Threading_Thread_Sleep_internal (ms=500, allow_interruption=1 '\001', error=<optimized out>) at /__w/1/s/src/mono/mono/metadata/threads.c:1839
#11 0x00007f5d81673cf8 in ves_icall_System_Threading_Thread_Sleep_internal_raw (a0=500, a1=1 '\001') at /__w/1/s/src/mono/mono/metadata/icall-def-netcore.h:510
#12 0x00000000415f0746 in ?? ()
#13 0x00007f5d80f83030 in ?? ()
#14 0x00007f5d77654e68 in ?? ()
#15 0x0000555946cd8928 in ?? ()
#16 0x00007f5d80f83030 in ?? ()
#17 0x00000000000001f4 in ?? ()
#18 0x00007f5d60002580 in ?? ()
#19 0x00007f5d77654be0 in ?? ()
#20 0x00007f5d77654ae0 in ?? ()
#21 0x00007f5d77654be0 in ?? ()
#22 0x00000000415f0628 in ?? ()
#23 0x00007f5d80f830b0 in ?? ()
#24 0x00005559463f8468 in ?? ()
#25 0x00007f5d77654be0 in ?? ()
#26 0x00000000415f05f0 in ?? ()
#27 0x00007f5d77654be0 in ?? ()
#28 0x00000000415ef774 in ?? ()
#29 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f5d77a65700 (LWP 27869)):
#0  0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d77a64060, expected=0, futex_word=0x555946394998) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7f5d77a64118, mutex=0x555946394948, cond=0x555946394970) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x555946394970, mutex=0x555946394948, abstime=0x7f5d77a64118) at pthread_cond_wait.c:667
#3  0x00007f5d81770d64 in mono_os_cond_timedwait (cond=0x555946394970, mutex=0x555946394948, timeout_ms=12000) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.c:75
#4  0x00007f5d816dfd04 in mono_coop_cond_timedwait (cond=<optimized out>, mutex=<optimized out>, timeout_ms=12000) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:103
#5  mono_w32handle_timedwait_signal_naked (cond=<optimized out>, mutex=<optimized out>, timeout=12000, poll=0, alerted=0x7f5d77a641bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:652
#6  mono_w32handle_timedwait_signal_handle (handle_data=<optimized out>, timeout=12000, poll=0, alerted=0x7f5d77a641bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:767
#7  0x00007f5d816dfbc7 in mono_w32handle_wait_one (handle=<optimized out>, timeout=12000, alertable=<optimized out>) at /__w/1/s/src/mono/mono/metadata/w32handle.c:882
#8  0x00007f5d816dfe24 in mono_w32handle_wait_multiple (handles=0x7f5d77a648e0, nhandles=1, waitall=<optimized out>, timeout=12000, alertable=1, error=0x2) at /__w/1/s/src/mono/mono/metadata/w32handle.c:1002
#9  0x00007f5d816c269f in ves_icall_System_Threading_WaitHandle_Wait_internal (handles=0x7f5d77a648e0, numhandles=1, waitall=0 '\000', timeout=12000, error=0x7f5d77a64808) at /__w/1/s/src/mono/mono/metadata/threads.c:2314
#10 0x00007f5d81673ebc in ves_icall_System_Threading_WaitHandle_Wait_internal_raw (a0=0x7f5d77a648e0, a1=1, a2=0 '\000', a3=12000) at /__w/1/s/src/mono/mono/metadata/icall-def-netcore.h:516
#11 0x00000000415a4bd2 in ?? ()
#12 0x00007f5d80f82e08 in ?? ()
#13 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f5d77860700 (LWP 27868)):
#0  0x00007f5d82e8723a in __waitpid (pid=27875, stat_loc=0x7f5d7785d2bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f5d815d7f01 in dump_native_stacktrace (signal=<optimized out>, mctx=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:1113
#2  mono_dump_native_crash_info (signal=<optimized out>, mctx=0x7f5d7785ddd0, info=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:1157
#3  0x00007f5d8157d7b5 in mono_handle_native_crash (signal=0x7f5d817f5b83 "SIGABRT", mctx=0x7f5d7785ddd0, info=0x7f5d7785e0b0) at /__w/1/s/src/mono/mono/mini/mini-exceptions.c:3427
#4  0x00007f5d815d7355 in sigabrt_signal_handler (_dummy=6, _info=0x7f5d7785e0b0, context=0x7f5d7785df80) at /__w/1/s/src/mono/mono/mini/mini-posix.c:234
#5  <signal handler called>
#6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#7  0x00007f5d81f81801 in __GI_abort () at abort.c:79
#8  0x00007f5d81786b35 in monoeg_assert_abort () at /__w/1/s/src/mono/mono/eglib/goutput.c:57
#9  0x00007f5d8176acd0 in mono_log_write_logfile (log_domain=<optimized out>, level=<optimized out>, hdr=<optimized out>, message=0x7f5d68036990 "mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow") at /__w/1/s/src/mono/mono/utils/mono-log-common.c:136
#10 0x00007f5d81786f25 in monoeg_g_logstr (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, msg=0x0) at /__w/1/s/src/mono/mono/eglib/goutput.c:151
#11 monoeg_g_logv_nofree (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=<optimized out>, args=<optimized out>) at /__w/1/s/src/mono/mono/eglib/goutput.c:166
#12 0x00007f5d81786fbe in monoeg_g_logv (log_domain=0x2 <error: Cannot access memory at address 0x2>, log_level=2005265504, format=0x0, args=0x7f5d7785f3a0) at /__w/1/s/src/mono/mono/eglib/goutput.c:173
#13 monoeg_g_log (log_domain=0x2 <error: Cannot access memory at address 0x2>, log_level=2005265504, format=0x0) at /__w/1/s/src/mono/mono/eglib/goutput.c:182
#14 0x00007f5d8177e313 in mono_threads_pthread_kill (info=0x7f5d64000b20, signum=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-posix.c:209
#15 mono_threads_suspend_begin_async_suspend (info=0x7f5d64000b20, interrupt_kernel=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-posix.c:358
#16 0x00007f5d8177ac82 in mono_thread_info_begin_suspend (info=0x7f5d64000b20, phase=<optimized out>) at /__w/1/s/src/mono/mono/utils/atomic.h:261
#17 0x00007f5d8171b88e in sgen_unified_suspend_stop_world () at /__w/1/s/src/mono/mono/metadata/sgen-stw.c:288
#18 sgen_client_stop_world (generation=0, serial_collection=0) at /__w/1/s/src/mono/mono/metadata/sgen-stw.c:124
#19 0x00007f5d8172e92d in sgen_stop_world (generation=0, serial_collection=0) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:3869
#20 0x00007f5d8172b66f in sgen_perform_collection_inner (requested_size=<optimized out>, generation_to_collect=<optimized out>, reason=<optimized out>, forced_serial=<optimized out>, stw=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2566
#21 sgen_perform_collection (requested_size=4096, generation_to_collect=0, reason=0x7f5d817eda16 "Nursery full", forced_serial=0, stw=1) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2679
#22 0x00007f5d8172b5d3 in sgen_ensure_free_space (size=4096, generation=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2545
#23 0x00007f5d81720f6e in sgen_alloc_obj_nolock (vtable=0x555946ae0c70, size=128) at /__w/1/s/src/mono/mono/sgen/sgen-alloc.c:279
#24 0x00007f5d817213e9 in sgen_alloc_obj (vtable=0x555946ae0c70, size=128) at /__w/1/s/src/mono/mono/sgen/sgen-alloc.c:454
#25 0x00007f5d8171d3c6 in mono_gc_alloc_obj (vtable=0x2, size=140039413951584) at /__w/1/s/src/mono/mono/metadata/sgen-mono.c:921
#26 0x00000000414f0f87 in ?? ()
#27 0x00007f5d80ffffc0 in ?? ()
#28 0x00007f5d68018d50 in ?? ()
#29 0x00007f5d680186e0 in ?? ()
#30 0x0000555946ae0c70 in ?? ()
#31 0x0000000000000080 in ?? ()
#32 0x00007f5d6803d800 in ?? ()
#33 0x00007f5d81000040 in ?? ()
#34 0x00007f5d7785f730 in ?? ()
#35 0x00007f5d80fd2308 in ?? ()
#36 0x00000000414c4d24 in ?? ()
#37 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f5d77b0c700 (LWP 27865)):
#0  0x00007f5d82e86384 in __libc_read (fd=5, buf=0x7f5d77b0bee7, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x00007f5d7eb5fd5e in SignalHandlerLoop (arg=0x555946ac6e20) at /__w/1/s/src/libraries/Native/Unix/System.Native/pal_signal.c:90
#2  0x00007f5d82e7c6db in start_thread (arg=0x7f5d77b0c700) at pthread_create.c:463
#3  0x00007f5d8206288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f5d7f13b700 (LWP 27864)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f5d82e7f023 in __GI___pthread_mutex_lock (mutex=0x7f5d81a82490 <sgen_gc_mutex>) at ../nptl/pthread_mutex_lock.c:78
#2  0x00007f5d8172bfb9 in mono_os_mutex_lock (mutex=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.h:105
#3  mono_coop_mutex_lock (mutex=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:57
#4  sgen_gc_lock () at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:3811
#5  0x00007f5d8171f5a8 in mono_gc_skip_thread_changing (skip=0) at /__w/1/s/src/mono/mono/metadata/sgen-mono.c:2201
#6  0x00007f5d8177a308 in mono_thread_info_set_flags (flags=MONO_THREAD_INFO_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-threads.c:814
#7  0x00007f5d8170de0e in finalizer_thread (unused=<optimized out>) at /__w/1/s/src/mono/mono/metadata/gc.c:969
#8  0x00007f5d816c7b0a in start_wrapper_internal (start_info=0x0, stack_ptr=<optimized out>) at /__w/1/s/src/mono/mono/metadata/threads.c:1233
#9  0x00007f5d816c7989 in start_wrapper (data=0x555946406350) at /__w/1/s/src/mono/mono/metadata/threads.c:1308
#10 0x00007f5d82e7c6db in start_thread (arg=0x7f5d7f13b700) at pthread_create.c:463
#11 0x00007f5d8206288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f5d80bff700 (LWP 27863)):
#0  0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f5d81a7c798 <work_cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f5d81a7c748 <lock>, cond=0x7f5d81a7c770 <work_cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f5d81a7c770 <work_cond>, mutex=0x7f5d81a7c748 <lock>) at pthread_cond_wait.c:655
#3  0x00007f5d8175fa53 in mono_os_cond_wait (cond=<optimized out>, mutex=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.h:219
#4  get_work (worker_index=<optimized out>, work_context=<optimized out>, do_idle=<optimized out>, job=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-thread-pool.c:165
#5  thread_func (data=0x0) at /__w/1/s/src/mono/mono/sgen/sgen-thread-pool.c:196
#6  0x00007f5d82e7c6db in start_thread (arg=0x7f5d80bff700) at pthread_create.c:463
#7  0x00007f5d8206288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f5d832a8740 (LWP 27862)):
#0  0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555946394848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5559463947f8, cond=0x555946394820) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x555946394820, mutex=0x5559463947f8) at pthread_cond_wait.c:655
#3  0x00007f5d81770d80 in mono_os_cond_wait (cond=0x555946394820, mutex=0x5559463947f8) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.h:219
#4  mono_os_cond_timedwait (cond=0x555946394820, mutex=0x5559463947f8, timeout_ms=0) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.c:32
#5  0x00007f5d816dfd04 in mono_coop_cond_timedwait (cond=<optimized out>, mutex=<optimized out>, timeout_ms=4294967295) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:103
#6  mono_w32handle_timedwait_signal_naked (cond=<optimized out>, mutex=<optimized out>, timeout=4294967295, poll=0, alerted=0x7ffeb6cbe9bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:652
#7  mono_w32handle_timedwait_signal_handle (handle_data=<optimized out>, timeout=4294967295, poll=0, alerted=0x7ffeb6cbe9bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:767
#8  0x00007f5d816dfbc7 in mono_w32handle_wait_one (handle=<optimized out>, timeout=4294967295, alertable=<optimized out>) at /__w/1/s/src/mono/mono/metadata/w32handle.c:882
#9  0x00007f5d816dfe24 in mono_w32handle_wait_multiple (handles=0x7ffeb6cbf0e0, nhandles=1, waitall=<optimized out>, timeout=4294967295, alertable=1, error=0x0) at /__w/1/s/src/mono/mono/metadata/w32handle.c:1002
#10 0x00007f5d816c269f in ves_icall_System_Threading_WaitHandle_Wait_internal (handles=0x7ffeb6cbf0e0, numhandles=1, waitall=0 '\000', timeout=-1, error=0x7ffeb6cbf008) at /__w/1/s/src/mono/mono/metadata/threads.c:2314
#11 0x00007f5d81673ebc in ves_icall_System_Threading_WaitHandle_Wait_internal_raw (a0=0x7ffeb6cbf0e0, a1=1, a2=0 '\000', a3=-1) at /__w/1/s/src/mono/mono/metadata/icall-def-netcore.h:516
#12 0x00000000415a4bd2 in ?? ()
#13 0x00007f5d80ee7b10 in ?? ()
#14 0x0000000000000000 in ?? ()

=================================================================
    Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7f5d81f7fe97):0x7f5d81f7fe87  d2 4c 89 ce bf 02 00 00 00 b8 0e 00 00 00 0f 05  .L..............
0x7f5d81f7fe97  48 8b 8c 24 08 01 00 00 64 48 33 0c 25 28 00 00  H..$....dH3.%(..
0x7f5d81f7fea7  00 44 89 c0 75 1f 48 81 c4 18 01 00 00 c3 0f 1f  .D..u.H.........
0x7f5d81f7feb7  00 48 8b 15 a9 bf 3a 00 f7 d8 41 b8 ff ff ff ff  .H....:...A.....

=================================================================
    Managed Stacktrace:
=================================================================
      at <unknown> <0xffffffff>
      at System.Object:__icall_wrapper_mono_gc_alloc_obj <0x00076>
      at System.Object:AllocSmall <0x00103>
      at Xunit.DelegatingExecutionSummarySink:OnMessageWithTypes <0x0032b>
      at Xunit.DelegatingXmlCreationSink:OnMessageWithTypes <0x00054>
      at Xunit.DelegatingLongRunningTestDetectionSink:OnMessageWithTypes <0x00071>
      at Xunit.MessageSinkAdapter:OnMessageWithTypes <0x0004c>
      at Xunit.OptimizedRemoteMessageSink:OnMessage <0x0005d>
      at Xunit.Sdk.MessageBus:DispatchMessages <0x00094>
      at Xunit.Sdk.MessageBus:ReporterWorker <0x0005f>
      at <>c:<QueueUserWorkItem>b__5_0 <0x0006b>
      at System.Threading.Tasks.Task:InnerInvoke <0x000a2>
      at <>c:<.cctor>b__274_0 <0x0002f>
      at System.Threading.ExecutionContext:RunInternal <0x00101>
      at System.Threading.Tasks.Task:ExecuteWithThreadLocal <0x00283>
      at System.Threading.Tasks.Task:ExecuteEntryUnsafe <0x00097>
      at <>c:<.cctor>b__10_0 <0x0005b>
      at System.Threading.Thread:StartCallback <0x00109>
      at System.Object:runtime_invoke_void__this__ <0x00091>
=================================================================
./RunTests.sh: line 161: 27862 Aborted                 "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Composition.AttributeModel.Tests.runtimeconfig.json --depsfile System.Composition.AttributeModel.Tests.deps.json xunit.console.dll System.Composition.AttributeModel.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -notrait category=nonnetcoreapptests -notrait category=nonlinuxtests $RSP_FILE
~/work/A8F20901/w/8E960823/e
----- end Sat Feb 15 08:36:14 UTC 2020 ----- exit code 134 ----------------------------------------------------------
exit code 134 means SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped.
Looking around for any Linux dump...
... found no dump in /home/helixbot/work/A8F20901/w/8E960823/e

lambdageek commented 4 years ago

This is happening during a GC STW when Mono is trying to stop a thread. Evidently pthread_kill is returning EAGAIN (11) which is an indication that the kernel's global real-time queue has overflowed. Not clear if this is a transient condition and we could try again or not.

In mono/mono@e30184752b7047007352e7d384cc471875168a15 we added the ability to abort a preemptive suspend request (at which point GC STW will loop back around and try again), so we could try to recover from this EAGAIN thing on Linux.

What do you think @vargaz @lateralusX ?

Additional reference: here's how Boehm picks the suspend signal https://github.com/ivmai/bdwgc/blob/b44088c81ce1e864b4c1d05a523c0150beeb7b8b/include/private/gc_priv.h#L2635

lateralusX commented 4 years ago

On Windows we will follow the semantics (believe mac does similar), if suspend fails (or GetThreadContext) the thread will be left running and mono_threads_suspend_begin_async_suspend will return FALSE and state machine is reset to a state reflecting thread to run as normal (fixed by mono/mono@e301847). On Linux as you said, you currently get an error that will take down the process before any of that could happen, resetting thread to running state and returning FALSE back to caller of mono_threads_suspend_begin_async_suspend. When STW is driving the suspend, returning FALSE back will mark that thread to be skipped, so even if thread state machine is correctly reflecting current state of thread (preventing future asserts), I believe our STW implementation is not fully implemented to support such a scenario, since STW mainly view those threads as currently shutting down or detaching from the runtime, not expecting them to continue running managed code. The fix resetting the state machine on failure had a different use case, in that case threads calling suspend/resume on other threads could trigger this failure and then next GC would trigger an assert in state machine, even if the thread was currently running and could successfully be suspended by STW.

So if this is hit during normal STW, then I believe current machinery won't trigger a retry in STW, when mono_threads_suspend_begin_async_suspend returns FALSE. So if we are going to rely on that it needs to be fixed in STW.

I know we had some sporadic failures calling SuspendThread that on some downstream Mono implementations was resolved with a retry loop (with a limited set of retry's), if all retries where exhausted, it would still return FALSE and let caller handle it. Maybe we could do something similar on Linux, there you also have a specific error code that you should retry, but still it might make sense to have a limited set of retries.

I also believe we should look into fixing STW to better understand failures due to suspend a thread vs threads shutting down, detaching from runtime etc or we will always have the risk of exiting a STW with running threads due to suspend failures.

am11 commented 4 years ago

Fails again in #33716

https://helix.dot.net/api/2019-06-17/jobs/c1c621cc-34bb-4eb9-b949-0b0837654e7e/workitems/System.Collections.NonGeneric.Tests/console

https://helix.dot.net/api/2019-06-17/jobs/c1c621cc-34bb-4eb9-b949-0b0837654e7e/workitems/System.Text.Encoding.CodePages.Tests/console

Different set of assemblies fail in three runs:

ericstj commented 4 years ago

Failing in https://github.com/dotnet/runtime/pull/33816

Saw this in different tests upon rerun: Microsoft.Extensions.FileProviders.Physical.Tests System.Xml.XPath.Tests Microsoft.Extensions.DependencyInjection.Tests System.Xml.RW.ReaderSettings.Tests

https://github.com/dotnet/runtime/pull/33915 Microsoft.CSharp.Tests System.Net.Ping.Functional.Tests System.Xml.RW.XmlReader.Tests

https://github.com/dotnet/runtime/pull/33917 System.Xml.RW.XmlReader.Tests

https://github.com/dotnet/runtime/pull/33852 System.Text.Encoding.Tests

Rolling build https://dev.azure.com/dnceng/public/_build/results?buildId=568429 System.Net.Primitives.Functional.Tests System.Resources.ResourceManager.Tests System.Security.Cryptography.Algorithms.Tests System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests

I just sampled the top failing builds for dotnet/runtime and all of them had the mono leg failing with this error. Seems like we need to have someone looking at this soon or disable the mono test leg. @marek-safar

@jaredpar, can you run your tool (or remind me how) to get the complete data on this?

All failing tests have console output with mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow

jaredpar commented 4 years ago

Here are the failures over the last 100 builds

Build	Kind	Console Log
568318	PR https://github.com/dotnet/runtime/pull/33911	console.log
568429	Rolling	console.log
568083	PR https://github.com/dotnet/runtime/pull/33816	console.log
568583	PR https://github.com/dotnet/runtime/pull/33915	console.log
568564	Rolling	console.log
568452	PR https://github.com/dotnet/runtime/pull/33852	console.log
568546	PR https://github.com/dotnet/runtime/pull/33917	console.log
568335	PR https://github.com/dotnet/runtime/pull/33674	console.log
568356	PR https://github.com/dotnet/runtime/pull/33883	console.log
568322	Rolling	console.log
568285	PR https://github.com/dotnet/runtime/pull/33910	console.log
568331	PR https://github.com/dotnet/runtime/pull/33589	console.log
567761	PR https://github.com/dotnet/runtime/pull/33892	console.log
568102	PR https://github.com/dotnet/runtime/pull/33899	console.log
568081	Rolling	console.log
567048	PR https://github.com/dotnet/runtime/pull/33716	console.log
568160	PR https://github.com/dotnet/runtime/pull/33899	console.log
568064	Rolling	console.log
567899	PR https://github.com/dotnet/runtime/pull/33816	console.log
568200	PR https://github.com/dotnet/runtime/pull/33704	console.log
566817	PR https://github.com/dotnet/runtime/pull/33850	console.log
568158	PR https://github.com/dotnet/runtime/pull/33905	console.log
567968	PR https://github.com/dotnet/runtime/pull/33848	console.log
567720	PR https://github.com/dotnet/runtime/pull/33890	console.log
567750	PR https://github.com/dotnet/runtime/pull/33834	console.log
566811	PR https://github.com/dotnet/runtime/pull/33849	console.log
567962	PR https://github.com/dotnet/runtime/pull/33886	console.log
568000	PR https://github.com/dotnet/runtime/pull/33889	console.log
567723	PR https://github.com/dotnet/runtime/pull/33701	console.log
567777	PR https://github.com/dotnet/runtime/pull/33868	console.log
567842	PR https://github.com/dotnet/runtime/pull/33896	console.log
567923	Rolling	console.log
567658	PR https://github.com/dotnet/runtime/pull/33589	console.log
567857	PR https://github.com/dotnet/runtime/pull/33869	console.log
567074	PR https://github.com/dotnet/runtime/pull/33857	console.log
567199	PR https://github.com/dotnet/runtime/pull/33444	console.log
567026	PR https://github.com/dotnet/runtime/pull/33855	console.log
567452	PR https://github.com/dotnet/runtime/pull/33841	console.log
567702	PR https://github.com/dotnet/runtime/pull/33889	console.log
567349	PR https://github.com/dotnet/runtime/pull/33827	console.log
567642	PR https://github.com/dotnet/runtime/pull/33459	console.log
567233	Rolling	console.log
567449	PR https://github.com/dotnet/runtime/pull/33874	console.log
567236	PR https://github.com/dotnet/runtime/pull/33869	console.log
567326	PR https://github.com/dotnet/runtime/pull/33871	console.log
567175	PR https://github.com/dotnet/runtime/pull/33834	console.log
567171	PR https://github.com/dotnet/runtime/pull/33827	console.log
567150	PR https://github.com/dotnet/runtime/pull/33819	console.log
567106	PR https://github.com/dotnet/runtime/pull/33860	console.log
567099	Rolling	console.log
567117	PR https://github.com/dotnet/runtime/pull/33849	console.log
566861	PR https://github.com/dotnet/runtime/pull/33851	console.log
566950	PR https://github.com/dotnet/runtime/pull/33852	console.log

Command

runfo search-helix -d runtime -c 100 -pr -v "mono_threads_pthread_kill: pthread_kill failed with error 11"

am11 commented 4 years ago

Seems like we need to have someone looking at this soon or disable the mono test leg.

Could this be a default action to disable the failing tests followed by a draft PR enabling it, which someone can be assigned to push fix commits at? This way only one PR would be red instead of all of them.

marek-safar commented 4 years ago

@lambdageek could you please take care of this

lambdageek commented 4 years ago

Ok, we're trying to fix this with https://github.com/dotnet/runtime/pull/33966

The issue is:

each thread on Linux has a queue of "real time" (signal number >= SIGRTMIN) signals.
Something is overflowing the queue.
The failure happens when we use one of those signals to suspend threads when the GC needs to stop the world.

How we're addressing the issue:

https://github.com/dotnet/runtime/pull/33966 will look for the EAGAIN error and sleep and retry instead of failing.
Longer term, we will need to work on a fix that propagates the cause of the suspend failure to the suspend initiator (the GC) and let it handle it, instead of just failing. The suspend initiator will then be adjusted to try again (this is better than 33966 because the suspend initiator has more context).
Longest term in the future, fully cooperative suspend will address this by not using signals for suspends at all. Right now we're using a hybrid scheme on linux where threads that are running native P/Invoked code or that are in a blocking operation are suspended using pthread_kill. That will not be necessary with full coop. (But full coop requires an embedding API break, which is not something we're doing right now)

Update: Another possibility is that the stop the world procedure is going haywire and just sending out a ton of pthread_kills for some reason - i'm going to look over the console logs from Jared's comment to see if there's evidence of that.

Update 2: I was reading things about the signal queue wrong, I think. it's a per-thread queue. So more likely that we're flooding it somehow

jaredpar commented 4 years ago

Closing as the issue looks like it's been addressed. Hvaven't seen this in recent runs.

safern commented 4 years ago

@lambdageek it seems like this is happening again: https://github.com/dotnet/runtime/pull/34976 https://helix.dot.net/api/2019-06-17/jobs/45edf365-4514-43c6-a782-5a696070625c/workitems/System.Composition.Convention.Tests/console

https://github.com/dotnet/runtime/pull/35040 https://helix.dot.net/api/2019-06-17/jobs/4a9ad8fc-dd75-4b7f-ace9-401d8d08c4cd/workitems/System.Net.Http.Json.Functional.Tests/console

ViktorHofer commented 4 years ago

Just happened again: https://dev.azure.com/dnceng/public/_build/results?buildId=620968&view=ms.vss-test-web.build-test-results-tab&runId=19433608&paneView=attachments&resultId=175933

lambdageek commented 4 years ago

Ok, on a hunch I checked what RLIMIT_SIGPENDING is on helix bots. and it's

ulimit -i is 31805

which is reasonable.

But nonetheless somehow we're flooding the realtime signal queue. I don't really have any good ideas for how it might be happening, though. If it's a mono problem, it doesn't seem like just bubbling up the pthread_kill error to the GC STW loop is a good plan (lots of effort and unclear that it'll fix the problem).

Going to try running some libraries testsuite like System.Net.Sockets.Tests on a local Ubuntu VM to see if I can reproduce this locally.

lambdageek commented 4 years ago

Looking at the stack trace from Viktor's comment, I see a lot of threads in mono_coop_cond_timedwait, so I did an experiment where I send a realtime signal to a thread waiting on a condition variable

Unfortunately, unless I block the signal with pthread_sigmask or sigprocmask, it will get delivered to the thread without overflowing the queue even while it's in the call.

The only uses of pthread_setmask or sigprocmask in dotnet/runtime/src/mono and dotnet/runtime/src/libraries don't appear to block our GC suspend signal, so I don't think this is the problem.

There are a few other threads in other system calls that might be suspicious:

there's Thread 25 the System.Native SignalHandlerLoop function that's in a read call - this thread sometimes calls managed callbacks, which means that Mono can see it and would try to preemptively suspend it if it was in GC Safe mode.

So next I'm going to see if read on a pipe is uninterruptible by realtime signals. If that's the case, we might need to deal with this thread specially (or maybe refactor it in some way)
Threads 24 and 23 are in epoll_wait but that is explicitly documented to return EINTR so I would be surprised if the signal handler doesn't run here for some reason
Thread 3 is in a pthread_mutex_lock - I expect that this is interruptible since the condvar example was interruptible - it would be surprising if different pthread primitives work differently.

lambdageek commented 4 years ago

Hm. read from a pipe and pthread_mutex_lock both interruptible. I don't think this line of investigation is going to pan out.

Will focus on pthread_sigmask - there's a place in pal_process.c where all the signals are blocked while the process forks. I wonder if under load we bombard this thread with signals while it doesn't get a chance to run.

lambdageek commented 4 years ago

Update:

Doesn't look like pthread_sigmask in pal_process.c was relevant
I still can only reproduce this failure by artificially setting ulimit -i 2 (real time signal queue limit is 2 events) before running a testsuite.

Current plan:

Despite not having a reliable reproduction, the right thing to do is to extend the thread suspend machinery to deal with transient thread suspension failures. This comes up on non-POSIX-signals backends too - for example the Win32 SuspendThread function can sometimes return -1 in which case we call mono_threads_transition_abort_async_suspend to try to back out of suspending the thread. If we succeed, the thread is still in a running state and we can try again.

Currently however, mono_threads_suspend_begin_async_suspend returns a boolean on all platforms, and so if suspending doesn't work we don't know whether to try again. (Currently we mark the thread as skipped and exclude it from the remainder of STW. This is a bad idea when the thread is interacting with the managed heap.)

So the right solution is to return a more refined return code and give STW a chance to try again if multiple threads are in a transient failure state.

directhex commented 3 years ago

I can't find any cases of this happening in the last 2 weeks. Of course, it's possible that's just my tooling misbehaving, but let's close for now given it's quite plausibly no longer an issue

dotnet / runtime

Several tests fail on mono with mono_threads_pthread_kill: pthread_kill failed with error 11 #32377