Resque workers deadlocked from ddtrace after processing jobs

A few weeks ago we updated ddtrace in our application from version 0.53 to 1,11.1. Shortly after deploying this to production we noticed that some of our Resque workers would hang after they processed a job resulting in the worker to stop processing further jobs and causing a backups in the Resque queues. We identified the ddtrace upgrade as the source of this problem and reverted back to version 0.53 to resolve it for the time being.

Since then I've been working to identify what the root cause of this is. Here's what I know so far:

On versions of ddtrace 1.3.0 and beyond Resque workers will become stuck, apparently deadlocked on a futex, during a Ruby cleanup method after the threads are marked as killed. The forked Resque process continues sending a heartbeat however leading Resque to believe that it is simply processing a really long job. This will remain the case indefinitely until the process is manually killed.
This issue is only reproducible when many Resque workers are running. Initial attempts to reproduce with a single worker were unsuccessful, except for one anomalous case which was never reproduced a second time. In testing, at least 10 workers needed to be running in order to consistently reproduce the behavior.
Most of the time the Resque workers would hang almost immediately or within a few minutes of starting to process jobs. However, in some cases it would take multiple hours to demonstrate the behavior.
- Interestingly, tests with version 1.12 yielded hung workers almost always immediately and tests with version 1.3 typically took more time. Whether something is making the problem more frequent in version 1.12, the problem manifests itself differently, or there are multiple problems in later versions is undetermined.
Using GDB I got a stack trace of where the threads are getting stuck. See below for this. It appears to be deadlocked on a futex during a Ruby GC run, specifically while unloading a library. I assume this is the ddtrace's native extension library, but I cannot conclusively determine this.
I bisected the problematic behavior to being introduced by commit 26ac04c06f87918eae773692dd428491bf8a11a4. This appears to add tracking of CPU-time for profiling.
- I've tried going through the changes of this commit to identify what could be causing the deadlock, but don't see anything obvious. I also tried disabling the CPU-time sampling by commenting out the relevant code, but the problem persists.

Threads as displayed by GDB:

(gdb) info threads
  Id   Target Id                                                     Frame
* 1    Thread 0x7ff481009f80 (LWP 29068) "bundle"                    0x00007ff481afdc5c in futex_wait (private=0, expected=2, futex_word=0x7ff44c1ff65c) at ../sysdeps/nptl/futex-internal.h:146
  2    Thread 0x7ff442a13640 (LWP 29069) "async_logger.r*" (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  3    Thread 0x7ff47a471640 (LWP 29070) "connection_poo*" (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  4    Thread 0x7ff44c1ff640 (LWP 29071) "Timeout stdlib " (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  5    Thread 0x7ff44c9a2640 (LWP 29072) "Statsd Sender" (Exiting)   futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  6    Thread 0x7ff44455b640 (LWP 29073) "Datadog::Tracin" (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146

GDB backtrace of all threads:

(gdb) thread apply all where

Thread 6 (Thread 0x7f73145ff640 (LWP 5414) "Datadog::Tracin" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f73145ff640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f73145ff640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7f731cd47640 (LWP 5413) "Statsd Sender" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f731cd47640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f731cd47640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f731c40b640 (LWP 5412) "Timeout stdlib " (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f731c40b640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f731c40b640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f734ab9d640 (LWP 5411) "connection_poo*" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f734ab9d640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f734ab9d640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f730f26f640 (LWP 5410) "async_logger.r*" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f730f26f640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f730f26f640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f7351121f80 (LWP 5409) "bundle"):
#0  0x00007f7351c15c5c in futex_wait (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:146
#1  futex_wait_simple (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:177
#2  __thread_gscope_wait () at ../sysdeps/nptl/dl-thread_gscope_wait.c:51
#3  0x00007f7351c04d97 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:498
#4  0x00007f7351c052a2 in _dl_close_worker (force=false, map=0x7f73215aab00) at ./elf/dl-close.c:150
#5  _dl_close (_map=0x7f73215aab00) at ./elf/dl-close.c:818
#6  0x00007f7351298c28 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd558e28d0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#7  0x00007f7351298cf3 in __GI__dl_catch_error (objname=0x7ffd558e2928, errstring=0x7ffd558e2930, mallocedp=0x7ffd558e2927, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#8  0x00007f73511b41ae in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#9  0x00007f73511b3ed8 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#10 0x00007f73214ed4c5 in library_free (library=0x7f7323f76b48) at DynamicLibrary.c:171
#11 0x00007f7351610d2a in run_final (zombie=140132487292600, objspace=0x7f7350c32000) at gc.c:4019
#12 finalize_list (objspace=objspace@entry=0x7f7350c32000, zombie=140132487292600) at gc.c:4038
#13 0x00007f7351615c10 in rb_objspace_call_finalizer (objspace=0x7f7350c32000) at gc.c:4202
#14 0x00007f73515f5058 in rb_ec_finalize (ec=0x7f7350c3b050) at eval.c:164
#15 rb_ec_cleanup (ec=<optimized out>, ex0=<optimized out>) at eval.c:256
#16 0x00007f73515f5302 in ruby_cleanup (ex=<optimized out>) at eval.c:178
#17 0x00007f73515f531f in ruby_stop (ex=<optimized out>) at eval.c:291
#18 0x00007f73516eac48 in rb_f_fork (obj=<optimized out>) at process.c:4407
#19 rb_f_fork (obj=<optimized out>) at process.c:4397
#20 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3fc40, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#21 0x00007f73517dac5a in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:349
#22 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:880
#23 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#24 0x00007f73517df3af in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=0x0, self=140132116319240, iseq=0x7f73468d7500, ec=0x7f7350c3b050) at vm.c:1316
#25 invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=0, argv=0x0, argc=0, self=140132116319240, captured=<optimized out>, ec=0x7f7350c3b050) at vm.c:1372
#26 invoke_block_from_c_bh (argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>, is_lambda=<optimized out>, force_blockarg=<optimized out>, cref=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1390
#27 vm_yield_with_cref (argc=0, argv=0x0, kw_splat=0, is_lambda=0, cref=0x0, ec=0x7f7350c3b050) at vm.c:1427
#28 vm_yield (kw_splat=0, argv=0x0, argc=0, ec=0x7f7350c3b050) at vm.c:1435
#29 rb_yield_0 (argv=0x0, argc=0) at /usr/src/ruby/vm_eval.c:1347
#30 loop_i (_=_@entry=0) at /usr/src/ruby/vm_eval.c:1446
#31 0x00007f73515f2a1e in rb_vrescue2 (b_proc=0x7f73517df170 <loop_i>, data1=0, r_proc=0x7f73517c17f0 <loop_stop>, data2=0, args=args@entry=0x7ffd558e3010) at eval.c:903
#32 0x00007f73515f2cae in rb_rescue2 (b_proc=<optimized out>, data1=<optimized out>, r_proc=<optimized out>, data2=<optimized out>) at eval.c:884
#33 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3fe40, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#34 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3fe40, cd=0x7f7346a650e0, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#35 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#36 0x00007f73517ddf4f in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2211
#37 0x00007f73517e35c9 in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=<optimized out>, self=<optimized out>, iseq=<optimized out>, ec=<optimized out>) at vm.c:1316
#38 invoke_iseq_block_from_c (me=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, self=<optimized out>, captured=<optimized out>, ec=<optimized out>) at vm.c:1372
#39 invoke_block_from_c_bh (force_blockarg=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1390
#40 vm_yield_with_cref (is_lambda=<optimized out>, cref=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1427
#41 vm_yield (kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1435
#42 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at /usr/src/ruby/vm_eval.c:1347
#43 rb_yield (val=<optimized out>) at /usr/src/ruby/vm_eval.c:1363
#44 0x00007f7351551f1c in rb_ary_each (ary=140132436390720) at array.c:2522
#45 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3ff40, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#46 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3ff40, cd=0x7f734cc6f460, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#47 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#48 0x00007f73517ddf4f in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2211
#49 0x00007f73515f2fd6 in rb_ensure (b_proc=0x7f7351044610 <monitor_sync_body>, data1=140132436390640, e_proc=<optimized out>, data2=<optimized out>) at eval.c:993
#50 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40080, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#51 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40080, cd=0x7f734c63d490, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#52 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#53 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#54 0x00007f73517e35c9 in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=<optimized out>, self=<optimized out>, iseq=<optimized out>, ec=<optimized out>) at vm.c:1316
#55 invoke_iseq_block_from_c (me=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, self=<optimized out>, captured=<optimized out>, ec=<optimized out>) at vm.c:1372
#56 invoke_block_from_c_bh (force_blockarg=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1390
#57 vm_yield_with_cref (is_lambda=<optimized out>, cref=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1427
#58 vm_yield (kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1435
#59 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at /usr/src/ruby/vm_eval.c:1347
#60 rb_yield (val=<optimized out>) at /usr/src/ruby/vm_eval.c:1363
#61 0x00007f7351551f1c in rb_ary_each (ary=140133177426480) at array.c:2522
#62 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40200, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#63 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40200, cd=0x7f734ccd6670, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#64 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#65 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#66 0x00007f7351656b16 in load_iseq_eval (fname=140133177451160, ec=0x7f7350c3b050) at load.c:660
#67 rb_load_internal (wrap=<optimized out>, fname=140133177451160) at load.c:721
#68 rb_f_load (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at load.c:796
#69 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40400, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#70 0x00007f73517d8774 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:349
#71 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:778
#72 0x00007f73517ddf4f in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2211
#73 0x00007f7351656b16 in load_iseq_eval (fname=140133183456280, ec=0x7f7350c3b050) at load.c:660
#74 rb_load_internal (wrap=<optimized out>, fname=140133183456280) at load.c:721
#75 rb_f_load (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at load.c:796
#76 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40480, calling=<optimized out>) at /usr/src/ruby
#77 0x00007f73517d8774 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optim
#78 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:778
#79 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#80 0x00007f7351656b16 in load_iseq_eval (fname=140133257554840, ec=0x7f7350c3b050) at load.c:660
#81 rb_load_internal (wrap=<optimized out>, fname=140133257554840) at load.c:721
#82 rb_f_load (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at load.c:796
#83 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d407c0, calling=<optimized out>) at /usr/src/ruby
#84 0x00007f73517d8774 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optim
#85 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:778
#86 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#87 0x00007f73515eef71 in rb_ec_exec_node (ec=ec@entry=0x7f7350c3b050, n=n@entry=0x7f73510af980) at eval.c:280
#88 0x00007f73515f538a in ruby_run_node (n=0x7f73510af980) at eval.c:321
#89 0x000056068a1df17f in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:47

Using GDB I also got a Ruby backtrace of the main thread. The other threads were marked as killed already by Ruby and did not have a control frame pointer anymore to read a backtrace from:

from /usr/local/bin/bundle:25:in `<main>'
from /usr/local/bin/bundle:25:in `load'
from /usr/local/lib/ruby/gems/3.1.0/gems/bundler-2.3.26/libexec/bundle:36:in `<top (required)>'
from /usr/local/lib/ruby/3.1.0/bundler/friendly_errors.rb:120:in `with_friendly_errors'
from /usr/local/lib/ruby/gems/3.1.0/gems/bundler-2.3.26/libexec/bundle:48:in `block in <top (required)>'
from /usr/local/lib/ruby/3.1.0/bundler/cli.rb:25:in `start'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
from /usr/local/lib/ruby/3.1.0/bundler/cli.rb:31:in `dispatch'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
from /usr/local/lib/ruby/3.1.0/bundler/cli.rb:486:in `exec'
from /usr/local/lib/ruby/3.1.0/bundler/cli/exec.rb:23:in `run'
from /usr/local/lib/ruby/3.1.0/bundler/cli/exec.rb:58:in `kernel_load'
from /usr/local/lib/ruby/3.1.0/bundler/cli/exec.rb:58:in `load'
from /usr/local/bin/rake:25:in `<top (required)>'
from /usr/local/bin/rake:25:in `load'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/exe/rake:27:in `<top (required)>'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:80:in `run'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:186:in `standard_exception_handling'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:83:in `block in run'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:110:in `top_level'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:125:in `run_with_threads'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:116:in `block in top_level'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:116:in `each'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:116:in `block (2 levels) in top_level'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:160:in `invoke_task'
from /usr/local/bundle/gems/ddtrace-1.11.1/lib/datadog/tracing/contrib/rake/instrumentation.rb:19:in `invoke'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:188:in `invoke'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:199:in `invoke_with_call_chain'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:199:in `synchronize'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:219:in `block in invoke_with_call_chain'
from /usr/local/bundle/gems/sentry-ruby-5.6.0/lib/sentry/rake.rb:26:in `execute'
from /usr/local/bundle/gems/ddtrace-1.11.1/lib/datadog/tracing/contrib/rake/instrumentation.rb:30:in `execute'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:281:in `execute'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:281:in `each'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:281:in `block in execute'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/tasks.rb:20:in `block (2 levels) in <top (required)>'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:226:in `work'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:226:in `loop'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:229:in `block in work'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:255:in `work_one_job'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:865:in `perform_with_fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:27:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:8:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:27:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:8:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:8:in `fork'

The one Datadog thread's full name is Datadog::Tracing::Workers::AsyncTransport per:

(gdb) p *((struct RString*) ruby_current_ec->thread_ptr->name)
$6 = {basic = {flags = 546449509, klass = 140133257733160}, as = {heap = {len = 41, ptr = 0x7f734c1ec700 "Datadog::Tracing::Workers::AsyncTransport", aux = {capa = 51, shared = 51}}, embed = {
      ary = ")\000\000\000\000\000\000\000\000\307\036Ls\177\000\000\063\000\000\000\000\000\000"}}}

I initially though the problem was with the AsyncTransport class, but after looking at the concurrency logic there I was unable to find any problems with it. Futhermore, returning to the backtraces above, the problem seems to lie elsewhere. Ruby has all threads other than the main thread marked as killed already so it would appear they joined the main thread cleanly.

The main thread on the other hand is attempting to unload a library when it becomes stuck per the backtrace. Namely, it is stopping the fork (rb_f_fork and ruby_stop) then cleaning up with ruby_cleanup and library_free before finally becoming deadlocked at futex_wait.

  0x00007f7351c15c5c in futex_wait (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:146
#1  futex_wait_simple (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:177
#2  __thread_gscope_wait () at ../sysdeps/nptl/dl-thread_gscope_wait.c:51
#3  0x00007f7351c04d97 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:498
#4  0x00007f7351c052a2 in _dl_close_worker (force=false, map=0x7f73215aab00) at ./elf/dl-close.c:150
#5  _dl_close (_map=0x7f73215aab00) at ./elf/dl-close.c:818
#6  0x00007f7351298c28 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd558e28d0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#7  0x00007f7351298cf3 in __GI__dl_catch_error (objname=0x7ffd558e2928, errstring=0x7ffd558e2930, mallocedp=0x7ffd558e2927, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#8  0x00007f73511b41ae in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#9  0x00007f73511b3ed8 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#10 0x00007f73214ed4c5 in library_free (library=0x7f7323f76b48) at DynamicLibrary.c:171
#11 0x00007f7351610d2a in run_final (zombie=140132487292600, objspace=0x7f7350c32000) at gc.c:4019
#12 finalize_list (objspace=objspace@entry=0x7f7350c32000, zombie=140132487292600) at gc.c:4038
#13 0x00007f7351615c10 in rb_objspace_call_finalizer (objspace=0x7f7350c32000) at gc.c:4202
#14 0x00007f73515f5058 in rb_ec_finalize (ec=0x7f7350c3b050) at eval.c:164
#15 rb_ec_cleanup (ec=<optimized out>, ex0=<optimized out>) at eval.c:256
#16 0x00007f73515f5302 in ruby_cleanup (ex=<optimized out>) at eval.c:178
#17 0x00007f73515f531f in ruby_stop (ex=<optimized out>) at eval.c:291
#18 0x00007f73516eac48 in rb_f_fork (obj=<optimized out>) at process.c:4407

This backtrace combined with the git bisecting of the ddtrace gem leads me to believe there is an issue with something in the native extension preventing Ruby from continuing with its cleanup of the fork resulting in a hung Resque worker process. This only appears to happen at high load with sufficient number of Resque workers. It's also worth noting that with 10 Resque workers, I never saw them all become stuck. At most 5-6 of them would become stuck over the timespan of ~12 hours. I assume this had to do with why multiple workers were needed to replicate the problem in the first place.

There's one other notable situation that happened too. I've inspected many of these deadlocked processes via GDB while debugging this and they all had the same backtrace. However, in a single case, which I never saw a second time, the backtrace was different with it being stuck on a getaddrinfo call. I don't know if this is relevant since it ultimately still ends stuck on a futex_wait call, but it stood out to me enough to save a copy of the backtrace when I saw it:

Attaching to process 20053
[New LWP 20058]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__pthread_kill_implementation (no_tid=0, signo=26, threadid=140422026360384) at ./nptl/pthread_kill.c:50
50 ./nptl/pthread_kill.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                           Frame
* 1    Thread 0x7fb6c9bcef80 (LWP 20053) "bundle"          __pthread_kill_implementation (no_tid=0, signo=26, threadid=140422026360384) at ./nptl/pthread_kill.c:50
  2    Thread 0x7fb68cfff640 (LWP 20058) "Datadog::Tracin" futex_wait (private=0, expected=2, futex_word=0x7fb6c9df3040 <lock>) at ../sysdeps/nptl/futex-internal.h:146
(gdb)

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fb68cfff640 (LWP 20058))]
#0  futex_wait (private=0, expected=2, futex_word=0x7fb6c9df3040 <lock>) at ../sysdeps/nptl/futex-internal.h:146
146  ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7fb6c9df3040 <lock>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7fb6c9df3040 <lock>) at ./nptl/lowlevellock.c:34
#2  0x00007fb6c9d14e14 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7fb68cef9922, seen_ipv6=seen_ipv6@entry=0x7fb68cef9923, in6ai=in6ai@entry=0x7fb68cef9930, in6ailen=in6ailen@entry=0x7fb68cef9938)
    at ../sysdeps/unix/sysv/linux/check_pf.c:307
#3  0x00007fb6c9cdded1 in __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, hints=<optimized out>, pai=<optimized out>) at ../sysdeps/posix/getaddrinfo.c:2289
#4  0x00007fb6c489ee3c in nogvl_getaddrinfo (arg=arg@entry=0x7fb68cef9fa0) at raddrinfo.c:190
#5  0x00007fb6ca240ba5 in rb_nogvl (func=0x7fb6c489ee20 <nogvl_getaddrinfo>, data1=0x7fb68cef9fa0, ubf=<optimized out>, data2=<optimized out>, flags=<optimized out>) at thread.c:1691
#6  0x00007fb6c48a16ae in rsock_getaddrinfo (host=140422157856960, port=<optimized out>, hints=hints@entry=0x7fb68cefa840, socktype_hack=socktype_hack@entry=1) at raddrinfo.c:562
#7  0x00007fb6c48a21d6 in rsock_addrinfo (host=<optimized out>, port=<optimized out>, family=family@entry=0, socktype=socktype@entry=1, flags=flags@entry=0) at raddrinfo.c:604
#8  0x00007fb6c489663b in init_inetsock_internal (v=v@entry=140422025292304) at ipsocket.c:63
#9  0x00007fb6ca09ffd6 in rb_ensure (b_proc=b_proc@entry=0x7fb6c48965c0 <init_inetsock_internal>, data1=data1@entry=140422025292304, e_proc=e_proc@entry=0x7fb6c4896570 <inetsock_cleanup>, 
    data2=data2@entry=140422025292304) at eval.c:993
#10 0x00007fb6c4896c7a in rsock_init_inetsock (sock=sock@entry=140421727738720, remote_host=remote_host@entry=140422157856960, remote_serv=remote_serv@entry=16253, local_host=local_host@entry=8, 
    local_serv=<optimized out>, type=type@entry=0, resolv_timeout=8, connect_timeout=8) at ipsocket.c:187
#11 0x00007fb6c489716e in tcp_init (argc=<optimized out>, argv=<optimized out>, sock=140421727738720) at tcpsocket.c:48
#12 0x00007fb6ca291161 in vm_call0_cfunc_with_frame (argv=0x7fb68cefd998, calling=<optimized out>, ec=0x7fb689a22050) at /usr/src/ruby/vm_eval.c:149
#13 vm_call0_cfunc (argv=0x7fb68cefd998, calling=<optimized out>, ec=0x7fb689a22050) at /usr/src/ruby/vm_eval.c:163
#14 vm_call0_body (ec=0x7fb689a22050, calling=<optimized out>, argv=0x7fb68cefd998) at /usr/src/ruby/vm_eval.c:209
#15 0x00007fb6ca293c64 in vm_call0_cc (kw_splat=<optimized out>, cc=<optimized out>, argv=<optimized out>, argc=4, id=3121, recv=140421727738720, ec=0x7fb689a22050) at /usr/src/ruby/vm_eval.c:86
#16 rb_call0 (ec=0x7fb689a22050, recv=140421727738720, mid=3121, argc=4, argv=<optimized out>, call_scope=<optimized out>, self=140422964243000) at /usr/src/ruby/vm_eval.c:550
#17 0x00007fb6ca294886 in rb_call (scope=CALL_FCALL, argv=0x7fb68cefd998, argc=4, mid=3121, recv=140421727738720) at /usr/src/ruby/vm_eval.c:876
#18 rb_funcallv_kw (recv=recv@entry=140421727738720, mid=mid@entry=3121, argc=argc@entry=4, argv=argv@entry=0x7fb68cefd998, kw_splat=kw_splat@entry=0) at /usr/src/ruby/vm_eval.c:1073
#19 0x00007fb6ca0a24e4 in rb_obj_call_init_kw (obj=obj@entry=140421727738720, argc=argc@entry=4, argv=argv@entry=0x7fb68cefd998, kw_splat=kw_splat@entry=0) at eval.c:1575
#20 0x00007fb6ca1582ee in rb_class_new_instance_kw (argc=argc@entry=4, argv=argv@entry=0x7fb68cefd998, klass=klass@entry=140422964243000, kw_splat=0) at object.c:2003
#21 0x00007fb6ca0d9d87 in rb_io_s_open (argc=4, argv=0x7fb68cefd998, klass=140422964243000) at io.c:7477

Our environment is as follows:

AWS ECS
Ubuntu 22.04
Ruby: 3.1.3 (MRI)
Rails: 6.1.7.4
Resque: 1.27.2.1
ddtrace: tested versions 1.2, 1.3, 1.11, and 1.12 (only 1.2.0 does not exhibit the issues described above)
Glibc: 2.35

Our ddtrace initialization config:

options = {analytics_enabled: false}

Datadog.configure do |c|
  c.runtime_metrics.enabled = true
  c.tracing.analytics.enabled = true
  c.profiling.enabled = true

  c.tracing.instrument :action_mailer, options
  c.tracing.instrument :action_pack, options
  c.tracing.instrument :action_view, options
  c.tracing.instrument :active_model_serializers, options
  c.tracing.instrument :active_record, options
  c.tracing.instrument :active_support, options
  c.tracing.instrument :active_job, options
  c.tracing.instrument :aws, options
  c.tracing.instrument :dalli, options
  c.tracing.instrument :delayed_job, options
  c.tracing.instrument :faraday, options
  c.tracing.instrument :rails, options

  c.tracing.instrument :racecar, options
  c.tracing.instrument :redis, options
  c.tracing.instrument :resque, options

  c.tracing.instrument :rack, options.merge(
    analytics_enabled: true,
    request_queuing: true,
  )
end

Datadog::Tracing.before_flush do |trace|
  trace.spans.each { |span| span.set_tag('custom.request_uuid', RequestStore.store[:request_uuid]) }
  trace
end

It's also worth noting that I was only able to replicate this behavior on AWS ECS. I tried reproducing it on a local Linux desktop (Arch rather than Ubuntu) and even when running 10 Resque workers could not replicate the issue. I don't know if that makes it environment specific or, since it only happens in high-load, that the considerably higher hardware capabilities of my desktop compared to the limited AWS containers prevented it from occurring.

I've also reviewed https://github.com/DataDog/dd-trace-rb/issues/466 which seems like a similar issue, but appears to be unrelated.

Overall, I've spent nearly two weeks getting to this point hoping to identify/solve the issue, but at least to obtain enough information to intelligently open a bug report. At this point I'm not sure what else to debug and am hoping for advice on what might be causing these deadlocks and how to further diagnose the problem. Thank you for reviewing.

Thanks @shanet for the detailed breakdown, and sorry that you're running into this issue!

Indeed it looks like the library_free from the ffi gem seems to be running and getting stuck somewhere in the glibc code.

Looking at __thread_gscope_wait it seems to grab the dl_stack_cache_lock before getting itself stuck on line 51 which is the same lock that __nptl_deallocate_stack needs so that seems to point at why the other threads are stuck as well.

To be honest, I'm not entirely sure that glibc is doing at this point, but before going deeper on glibc, there's a few things I'd suggest we look into at the Ruby level.

You mentioned bisecting the issue to 26ac04c06f87918eae773692dd428491bf8a11a4. That's... surprising since that change affected the new "CPU Profiling 2.0" that in 1.3.0 could not be turned on at all. (Enabling it happened in #2209, which was shipped in 1.4.0, and it was very experimental and not recommended).

But I suggest trying to remove the profiler from the equation -- can you set DD_PROFILING_NO_EXTENSION=true while installing ddtrace? This stops the native extension from being built at all (thus disabling the profiler), and so we can check if the issue happens without it.

You should see this in your logs after installing the gem with that option:

W, [2023-07-31T20:51:19.507276 #570693] WARN -- ddtrace: [ddtrace] Profiling was requested but is not supported, profiling disabled: Your ddtrace installation is missing support for the Continuous Profiler because the DD_PROFILING_NO_EXTENSION environment variable is/was set to true during installation. If you needed to use this, please tell us why on https://github.com/DataDog/dd-trace-rb/issues/new so we can fix it :)

Note that this setting is applied at gem installation time, so if the gem is already installed, it won't do anything.
Currently, ffi is an indirect dependency of ddtrace (via libddwaf). But libddwaf only gets enalbed if you enable the appsec functionality, which I don't see in your configuration. This left me really curious -- what library is getting garbage collected by ffi?

Since you're able to attach to the process, can you try to call dlinfo on the handle passed to library_free (reference) so we can figure out which library is being garbage collected?

I hope this will point us at either libddwaf, or at another gem that seems to, together with ddtrace, cause this incompatibility.

If needed, as an alternative, I can prepare a branch of ffi to log this information without needing gdb.

I hope that either one of these may provide clues to the next steps. Thanks again for the patience on this issue.

Hey @ivoanjo, thanks for taking a look at this so quickly and providing some ideas on what else to check.

You mentioned bisecting the issue to 26ac04c. That's... surprising since that change affected the new "CPU Profiling 2.0" that in 1.3.0 could not be turned on at all. (Enabling it happened in #2209, which was shipped in 1.4.0, and it was very experimental and not recommended).

I wouldn't place a ton of weight on this as bisecting this was very difficult. I learned that in some cases I had to let it run Resque jobs overnight before a worker would become hung. It's trivial to mark a commit as bad once a single worker hangs, but verifying one as good means accepting that a sufficient amount of time has passed without a worker hanging. As such, I cannot 100% conclusively say that this is where the problematic behavior was introduced, just that after letting many of these commits run for 12+ hours without problems this is where I landed.

Disabling ddtrace native extension:

I built a new image with the DD_PROFILING_NO_EXTENSION flag enabled, then started a Rails console to ensure it was set as expected in this image:

# rails c
[...]
W, [2023-07-31T23:36:49.952941 #3623]  WARN -- ddtrace: [ddtrace] Profiling was requested but is not supported, profiling disabled: Your ddtrace installation is missing support for the Continuous Profiler because the `DD_PROFILING_NO_EXTENSION` environment variable is/was set to `true` during installation. If you needed to use this, please tell us why on <https://github.com/DataDog/dd-trace-rb/issues/new> so we can fix it :)
[...]
irb(staging):001:0>

I then observed a hung Resque worker within about 15 minutes of starting this container. I guess that means the ddtrace extension is not responsible.

dlinfo check:

I went to the library_free frame and called dlinfo as shown below:

(gdb) set $info = malloc(256)
(gdb) call dlinfo(library->handle, RTLD_DI_ORIGIN, $info)
$3 = 0
(gdb) p (char*)$info
$5 = 0x7f8108464d00 "/usr/local/bundle/gems/http-parser-1.2.3/ext/x86_64-linux"

I checked seven hung Resque worker processes across two servers and found that the libraries in question were either:

/usr/local/bundle/bundler/gems/rdkafka-ruby-926b26d0a02e/lib/rdkafka/../../ext (three workers)
/usr/local/bundle/gems/http-parser-1.2.3/ext/x86_64-linux (four workers)

Why the library being unloaded is inconsistent and why it's these two specifically raises more questions than answers...

If it weren't for the fact that I know running ddtrace 0.x or (I believe) 1.2.0 fixes the issue I would be doubting it's even a ddtrace problem at this point.

Is there anything in ddtrace that may be affecting the loading/unloading of shared objects with Ruby? I'm really struggling to understand how this is related to ddtrace now.

Thanks for the replies!

If it weren't for the fact that I know running ddtrace 0.x or (I believe) 1.2.0 fixes the issue I would be doubting it's even a ddtrace problem at this point.

Is there anything in ddtrace that may be affecting the loading/unloading of shared objects with Ruby? I'm really struggling to understand how this is related to ddtrace now.

Yeah, I'm very curious about this myself.

There's nothing in ddtrace that tries to intercept loading/unloading of stuff. I half-wonder if what may be happening here is that ddtrace may be the "straw that breaks the camel's back", but for now let's proceed assuming that the issue is caused directly by ddtrace.

I was looking into the ffi gem code, and the stack you shared, and realized that Ruby has decided to run all of the finalizers before it exits. E.g. what's happened is that a Ruby process forked, then did its work, and now is cleaning up before the process shuts down.

This means that actually the unloading of libraries doesn't really need to happen, in fact ffi doesn't even do it on macOS: https://github.com/ffi/ffi/blob/a480ff6db209713fc73a3dbbb4f1c31a376d6e6a/ext/ffi_c/DynamicLibrary.c#L201-L205

...but this also suggests to me that, as long as http-parser and rdkafka are also loaded when you're using ddtrace 0.x, ffi is probably doing the same thing there as well. So why does it behave differently?

Looking at glibc code again, there seems to be a lot of stuff that sets the "thread gscope flag" (in a way, it looks to be a per-thread lock-lookalike). Because the flag/lock-lookalike in __thread_gscope_wait is per-thread, maybe we can spot which thread is the main thread waiting on.

That is, can you try checking in gdb what's the ((struct pthread *) p)->tid that the main thread is waiting on? Maybe that will provide an hint on what thread are we waiting on...

One other thing that came to mind is -- are you running resque with or without RUN_AT_EXIT_HOOKS? Since the impact is when the process is shutting down, it may be worth trying if it changes something here.

Aside from that, I guess two "avoid the issue fast" workarounds could be:

Avoid loading those gems in the Resque process if they aren't needed
Use a patched ffi that doesn't unload the libraries at shutdown

I think I understand what's happening here. The issue is that, by default, Resque forks before each job it processes. According to POSIX, forking a multi-threaded program and then calling a non-async-signal-safe function, like dlclose, results in undefined behavior including potential deadlocks.

Specifically, from the fork man page:

A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

So then in this case, ddtrace is running its background threads, resque forks to run a job, and then when the exited process is being cleaned up, it gets deadlocked on a mutex inside of the dlclose call.

After all this I think we've been having this problem of deadlocked Resque workers for a long time now, but never noticed it because it was so difficult to reproduce. Something in ddtrace version 1.3.0 and beyond made it start happening far more frequently. I suppose maybe because there's more background threads now? I'm not sure, I just know now that it's extremely difficult, but possible, to reproduce this problem with versions prior to 1.3.0, including 0.x versions as well.

The solution I came up with was to simply set Resque's FORK_PER_JOB env var to false so that it doesn't fork before processing a job. That avoids this whole failure mode and is also noticeably faster too due to the lack of needing to fork. The downside being potential memory bloat though. I also ran across a note about this from New Relic as well which suggests the same workaround. Although they don't mention why so maybe they don't understand the problem with their library?

In terms of ddtrace, I'm not sure if there's a fix here, or really if it's even a ddtrace problem as the issue lies with Resque's default forking model I believe. Maybe it would be good to stop the ddtrace threads before forking and then start them again in the parent process to avoid this problem? There's already a patched fork method in ddtrace so adding that logic in there wouldn't be too much work I suppose.

Regardless, thank you for your help. Once I realized the library being closed was not ddtrace's it made me turn my attention to glibc and only then did I realize the issue was not with the ddtrace upgrade we did, but with Resque's forking model in multithreaded environments.

Thanks for the feedback!

Indeed the discussion on newrelic that you mentioned, + the one on rescue side as well as sidekiq even does sound suspiciously familiar...

It's somewhat weird that is seems to happen that much more often with resque -- it seems the timing for it forking seems to cause this issue more often than with other tools in the Ruby ecosystem.

I'll make sure to open a PR to document this on our "Known issues and suggested configurations" section of the docs, so hopefully other users won't run into this really sharp edge.

It's somewhat weird that is seems to happen that much more often with resque -- it seems the timing for it forking seems to cause this issue more often than with other tools in the Ruby ecosystem.

I've been thinking about this a bit more and I suspect I understand why this seems to show up Resque.

I guess it's right there in the name of the feature -- fork per job. Considering you may have a bunch of machines running resque, and they're processing many jobs, then what the Resque folks accidentally built is a Ruby app fork safety fuzzer -- it probably forks so much more often than anything else in the Ruby ecosystem that any once-in-a-blue-moon forking bugs start showing up every day. That would explain why these kinds of issues seem to show up around Resque first...

Right, it is happening consistently in production for us, but outside of prod the only way I could reproduce it was by putting extremely high load through Resque, like on the order of hundred of thousands of jobs being worked through by 10-15 workers. That's definitely a lot of forking!

Also, for anyone else that stumbles across this in the future, our security team wasn't super thrilled with setting FORK_PER_JOB to false so another solution I came up with was stopping ddtrace's threads before the fork and restarting them afterward with Resque's before_fork and after_fork callbacks as such:

Rails.application.config.datadog_initialize = proc do
  Datadog.configure do |c|
     [...]
  end
end

Rails.application.config.datadog_initialize.call

  Resque.before_fork do
    Datadog.shutdown!
  end

  Resque.after_fork do |job|
    Rails.application.config.datadog_initialize.call
  end

We're still testing the above as I write this, but it's looking good so far. The downside to this is that it only handles ddtrace's threads. If there's any other unsafe-fork threads running before the fork the same issue may occur and those would need to be stopped & restarted as well.

Hi there, just chiming to say we are experiencing this problem as well...except we're using Sidekiq Swarm. Sidekiq Swarm boots the app, then forks a desired number of child processes. We're going to try and rollout the same trick above and will report back.

@ivoanjo would it be possible to do what Rails does and use Process._fork and reset things in Datadog internals?

Thanks for letting us know about Swarm as well.

@ngan, is the issue for you easily reproduceable, or something that happens more rarely? Having a reproducer for this issue would definitely help in us experimenting a bit with a few options (such as hooking on fork).

Also, could you share your Datadog configuration as well? We'd like to look into if some settings may correlate to this happening more often vs less often.

is the issue for you easily reproduceable

It's fairly reproducible on production for us. We process millions of jobs a day with about ~1400 workers. We'll see 5 of them get stuck after 20 or so minutes. When we send a TTIN signal, we get...

Thread TID-1fyft Datadog::Tracing::Workers::AsyncTransport
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:199:in `backtrace'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:199:in `block (2 levels) in class:CLI'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:196:in `each'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:196:in `block in class:CLI'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:211:in `handle_signal'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:136:in `launch'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-6.5.8/lib/sidekiq/cli.rb:121:in `run'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:237:in `block in forkit'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:234:in `fork'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:234:in `forkit'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:230:in `spawn'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:80:in `block in start'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:79:in `times'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:79:in `start'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/lib/sidekiq-ent/swarm.rb:90:in `start_and_monitor'
/usr/local/bundle/ruby/3.2.0/gems/sidekiq-ent-2.5.3/bin/sidekiqswarm:49:in `top (required)'
/var/www/bin/sidekiqswarm:27:in `load'
/var/www/bin/sidekiqswarm:27:in `main'

When we rolled out the workaround I mentioned above, we lost all Sidekiq traces 😓 so we had to revert. What's the official API/way to tell ddtrace to restart everything after forking?

Ok so turns out we did it incorrectly. This is the proper way to apply the workaround for sidekiq swarm:

config/initializers/sidekiq.rb

Sidekiq.configure_server do |config|
  if defined?(Sidekiq::Enterprise::Swarm)
    Datadog.shutdown!
  end

  config.on(:fork) do
    WhateverMethodToReconfigureDatadog.call
  end
end

We're going to roll this out now and will report back...

@ngan For reference, that solution I posted above did not end up working for us. In fact, I completely removed ddtrace from our Rails app and still experienced the deadlocked Resque workers. The severity is far less and I'm still not sure of the connection, but overall I believe the problem runs deeper with forking Ruby processes when there are loaded shared objects from gems' native extensions. The only solution I could find was to stop Resque from forking per the env var I wrote about above as I believe there's a fundamental architecture flaw with Resque's forking model in these situations. This is not a tenable solution for us due to security reasons however so I still need to find another solution that is to be determined. But the point is that I don't believe this is unique to ddtrace anymore at this point.

@shanet Ah. Welp, we can now also confirm that we are still seeing stuck workers with the workaround in place. 😅

Yeah, our current suspicion is that this may not be entirely under ddtrace's control, but something we do may make the issue more likely to show up.

If you shut down ddtrace before forking, then no ddtrace background threads should be running.

@ngan:

Can you confirm that after Datadog.shutdown! there are no threads with names starting with Datadog in Thread.list? That way we can validate if there's still some left over background work or not.
Can you share your Ruby version and ddtrace version and configuration as well? This information will allow us to identify if there's specific Ruby/ddtrace versions that seem to cause this more often.

Hi all, On our side we fixed it by calling an exit!(0) in a at_exit hook My investigation result were :

it's in the ruby implementation of the fork that seems having issue about handling of forked threads.
I created an issue in glibc bugzilla : https://sourceware.org/bugzilla/show_bug.cgi?id=30530 and having and answer about the fact that "Threads and fork don't mix well. After a fork only async-signal-safe functions can be called in the child." and it seems that in rb_fork_ruby2 of ruby project it use some exec_async_signal_safe, but don't sure it well completely implemented (https://github.com/ruby/ruby/blob/50c6cabadca44b7b034eae5dcc8017154a2858bd/process.c).

Maybe we have to check in https://bugs.ruby-lang.org/projects/ruby-master/ and create an issue if needed.

My two cents.

On our side we fixed it by calling an exit!(0) in a at_exit hook

This seems dangerous to me. For example, with ddtrace, what if you have a background thread with data still in its buffer to be sent over the network? By forcefully aborting the program you're not giving any threads the ability to exit gracefully and opening up to potential loss.

Threads and fork don't mix well. After a fork only async-signal-safe functions can be called in the child.

I ran across that glibc bug report when I started investigating this (it's actually how I first knew I wasn't crazy with this whole thing and someone else was having the same issue!). The comment on it regarding async-safe-safe functions appears correct to me though. dlclose is not a signal-safe function and that's what FFI is calling.

That's not to say FFI is at fault either, it's correctly cleaning up after itself by closing the shared objects. That's partly why this issue is so tricky, there's no single piece of software that's doing something wrong, but the confluence of all of them put together creates this extremely odd edge case.

That said, I haven't given up on this either. I've recently been trying an approach to kill all threads in an at_exit block with a SIGTERM so that glibc can't get stuck on a thread in the dlclose call and the threads can exit gracefully. I'm still validating and testing this and will update this thread with an example if it pans out. It would be specific to Linux though as it relies on the tgkill syscall, which is not portable.

I believe we solved our deadlocked resque workers issue with a combination of approaches in this thread. Here's a summary:

First and foremost, I believe the proper fix to this all is to not have Resque fork per job by setting the FORK_PER_JOB env var to false. There's a fundamental flaw with Resque's forking model if you have a multithreaded Ruby program. In my case, our security team was not accepting of losing the forking model because they liked the data isolation provided by a separate process and its subsequent cleanup after a job was run.

So, if you need to keep Resque forking, here's what I did: Define Resque before and after fork blocks as such:

task "resque:setup" => :environment do
  Resque.before_fork do
    Datadog.shutdown!
  end

  Resque.after_fork do |job|
    # Restart Datadog's threads after the fork
    Rails.application.config.datadog_initialize.call

    at_exit do
      Datadog.shutdown!
      Rails.logger.close if Rails.logger.respond_to?(:close)
      exit!(0)
    end
  end
end

And then for the Datadog initialization:

Rails.application.config.datadog_initialize = proc do
  Datadog.configure do |c|
    [ddtrace configuration...]
  end
end

Rails.application.config.datadog_initialize.call

The above is a combination of my previous attempts to shut down any background threads before the forked process exits and @CedricCouton's suggestion to make use of exit!. exit! will prevent Ruby from calling the finalizer method from the FFI gem which calls the poisonous dlclose method for any opened shared objects.

However, calling exit! alone risks your background threads not flushing any data in their buffers and leading to data loss. As such, I audited all of the at_exit handlers in our application (and all gems we use) to manually call their relevant shutdown methods in Resque's at_exit block. This ensures that they close cleanly before using the blunt hammer that is exit!. The above example only has calls for the threads relevant here; your application likely has others.

The other component here is that this approach solved 95% of the deadlocked workers, but months ago I observed a single deadlocked worker stuck on a getaddrinfo call which I never reproduced afterward. This failure mode became more apparent now and I started seeing deadlocked workers stuck on it. I believe https://github.com/DataDog/dd-trace-rb/issues/466 has more on that. To fix this, I added the Datadog.shutdown! call in the before_fork hook. This prevents the ddtrace thread from being put in an inconsistent state if the process forks at the wrong time. The downside here being that we won't get any telemetry from the parent process anymore, but we don't really need/use that anyway so that's okay. If it is needed Resque's worker_exit hook could be used to start it again I believe (I have not tested this).

I think that about wraps it up. Since deploying all of this and the 1.x ddtrace gem there have been no deadlocked workers which is a huge relief. My previous idea of using tgkill to kill all threads may or may not have worked, I abandoned it after realizing how effective exit! was. Through my reading of glibc to find a workaround, it seems that it sets an internal multiple_threads flag as soon as a second thread is started. This flag controls the cleanup logic in dlclose that causes the deadlock. Unfortunately, even if killing all threads this flag remains set to true so it's possible that the deadlocks would still happen. I didn't bother to validate this further as I had another, cleaner solution at this point.

Thanks to everyone in this thread for your ideas. This was certainly an exceedingly difficult problem to debug and find an acceptable fix for that didn't involve intrusive patches to Ruby, FFI, or even glibc.

Thanks @shanet for sharing your conclusions. Indeed, it's a really tough nut to crack, a combination of weaknesses in the whole ecosystem, with dd-trace-rb doing also its part to trigger the issue by having background threads sending data :/

For now this has been added as a known issue to https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/ruby/#resque-workers-hang-on-exit with our recommended workaround, so I'll close this ticket for now.

Please feel free to re-open if the workaround doesn't work or you need any help with it :)

@ivoanjo I wonder if you've ever considered decorating Process._fork (Ruby 3.1+) to automatically shutdown and restart the monitoring thread?

Some example of that approach: https://github.com/Shopify/grpc_fork_safety

Adopting the monkey patching of _fork is a good idea -- we're actually still monkey patching Process + Kernel + Object to catch forks because of old Rubies, but we could definitely have a separate code path for the newer Rubies :smile:

So, we're automatically restarting the threads currently as well...

The annoying problem in this case is the need to shut down the library on the parent process. This is where I hesitate a bit more:

It could mean losing visibility on some of the processes, although depending on what those processes are this could be irrelevant (and we could restart after the fork...)
shutdown! as discussed here is in a way a workaround for getting the library to a quiescent state, which does work in practice, but seems rather heavy-handed :thinking:. But would that be the default? Intuitively it looks like we'd be better off with a more specific solution to get the library state ready for forking than relying on shut down...

we're actually still monkey patching Process + Kernel + Object to catch forks because of old Rubies

Interesting. I'm not using the gem myself, was just asking questions because I heard of some fork issues (cc @beauraF). And they had to call shutdown on before_fork.

we'd be better off with a more specific solution to get the library state ready for forking than relying on shut down...

Yeah, something else I've used is simply to have a Mutex that wrap all the unsafe call (typically anything that can call getaddrinfo), so that you can jut acquire it without forking and fork without having to shutdown the thread.

... I heard of some fork issues (cc @beauraF)

@beauraF did you have an issue with Resque as well, or was it a different situation? I'd definitely like to know more, if you're up for sharing.

Yeah, something else I've used is simply to have a Mutex that wrap all the unsafe call (typically anything that can call getaddrinfo), so that you can jut acquire it without forking and fork without having to shutdown the thread.

This makes a lot of sense! It occurs to me that a read/write lock would be the perfect fit for something like this -- we could have as many concurrent "readers" of getaddrinfo as needed, but only one "writer" of fork that should exclude all others.

We're discussing moving more of our networking code into the native rust-built libdatadog gem, and such a refactoring would provide a good opportunity to also solve this issue.

I'll go ahead and create an issue to revisit this.

It occurs to me that a read/write lock would be the perfect fit for something like this

Yup: https://bugs.ruby-lang.org/issues/20590 https://github.com/ruby/ruby/pull/10864

Oh wow that's cool! I hadn't seen your PR haha, that's amazing :)

Hey 👋

@beauraF did you have an issue with Resque as well, or was it a different situation?

Yes, we had issue with resque. @CedricCouton and I are working together (see https://github.com/DataDog/dd-trace-rb/issues/3015#issuecomment-1709857797). To be precise, we in fact had issue with resque and pitchfork (without reforking).

At the end, we fixed it with:

module Doctolib
  module O11y
    module Extension
      module ForkSafety
        module Datadog
          def _fork
            Doctolib::Datadog.stop # ::Datadog.shutdown!
            pid = super
            Doctolib::Datadog.start # ::Datadog.configure(&configuration)
            pid
          end
        end
      end
    end
  end
end

Also, we did the same for dogstatsd and, I suspect https://docs.datadoghq.com/tracing/metrics/runtime_metrics/ruby/ to be what corrupted our pitchfork mold.

Happy to share everything you need, and even jump on a call if needed!

@beauraF Interesting, I hadn't seen reports of it affecting pitchfork before, but thinking back to how it works, I can see how it may also trigger the issue.

Happy to share everything you need, and even jump on a call if needed!

I'd like to take you up on your offer! Can you drop me a quick e-mail at ivo.anjo at employerhq so we can set that up?

Quick update here. While getaddrinfo can definitely result in deadlocked child, IIRC @beauraF said it started happening after upgrading to 3.3, and we also had deadlocked child in the Rails test suite on 3.3 specifically, and it was caused by https://github.com/ruby/ruby/pull/11356 which was backported into 3.3.5.

So anyone experiencing this on 3.3.x, I highly suggest to upgrade to 3.3.5.

As for getaddrinfo, I merged https://github.com/ruby/ruby/pull/10864, but that will only be in Ruby 3.4.

:bow: thanks for the heads-up! (And I'm VERY excited about https://github.com/ruby/ruby/pull/10864 )

DataDog / dd-trace-rb

Resque workers deadlocked from ddtrace after processing jobs #3015