DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
310 stars 375 forks source link

Crash tracker spawning long-running child processes? #3954

Closed luc-financeit closed 1 week ago

luc-financeit commented 1 month ago

Current behaviour

Upgrading the datadog gem from 2.2.0 to 2.3.0 causes a spec in our test suite to hang. The code being tested does something like

6.times do
  Process.fork { ... }
end

Process.waitall # Hangs

Looking at ps output, there's a libdatadog-crashtracking-receiver child process which our waitall is hung up on — manually killing the crashtracker process, or disabling the crashtracker entirely (via DD_CRASHTRACKING_ENABLED) allows the test to complete successfully.

F   UID     PID    PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0     328       0  20   0  16460  7680 -      Ss   ?          0:00 bash -l
4     0     557     328  20   0 190964 29656 -      Sl   ?          0:00  \_ ruby /usr/local/rvm/gems/ruby-3.1.6/bin/parallel_rspec --verbose spec
4     0     559     557  20   0 1414052 518752 -    Sl   ?          1:18      \_ ruby bin/rspec -O .rspec_parallel spec/the_hanging_spec.rb
0     0     560     559  20   0  24128  3844 -      S    ?          0:00          \_ /usr/local/rvm/gems/ruby-3.1.6/gems/libdatadog-11.0.0.1.0-x86_64-linux/vendor/libdatadog-11.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/bin/libdatadog-crashtracking-receiver 

Expected behaviour

No hang.

Steps to reproduce

I'm afraid I don't have a stand-alone reproduction — the actual code is significantly more complicated, and I haven't tried slimming it down to see if there's additional triggers. I can dig in to it more if the above isn't enough to go on.

Environment

anmarchenko commented 1 month ago

Hi @luc-financeit! Thanks for reporting this and sorry you've run into this issue.

You are absolutely correct, the crashtracker starts an external process that sends out crash report: if Ruby VM crashes, it is unable to send the report to us, so we need an additional process to send out these reports. I think there is really no other way of doing this.

Given that this is not critical for gem datadog or datadog-ci to work correctly, disabling crashtracker via DD_CRASHTRACKING_ENABLED=0 looks like a sensible thing to do. Would this workaround work for you?

delner commented 1 month ago

Thanks for reporting @luc-financeit ! And to @anmarchenko for a quick response and work around.

I'm sharing this with our team, see if we can diagnose/plan on what to do about this. We will share updates here as we have them.

TonyCTHsu commented 1 month ago

👋 @luc-financeit Thanks for reporting.

I can reproduce it with

require "bundler/inline"
gemfile { gem "datadog", "2.3.0", source: "https://rubygems.org" }

require "datadog"
Datadog.configure {}

Process.waitall

Indeed, Process.waitall is hanging when crashtracking is enabled. In addition, I have found that this behaviour does not applied to other of methods from Process (.wait, wait2, waitpid and waitpid2) when provided with specific pid.

For example:

pid = Process.fork do
  puts Process.pid
end

Process.wait(pid)

We will be looking into this issue and I suggest circumvent this by setting DD_CRASHTRACKING_ENABLED=false temporarily in your test suite.

p-datadog commented 1 month ago

waitall waits for all child processes, and will attempt to wait for child processes spawned by crash tracker.

It is convenient to use but does make the assumption that there aren't any processes spawned in the background by dependent libraries or the core runtime. A more robust solution is to keep track of pids for processes spawned by the application and wait for them explicitly using one of the other wait* methods.

luc-financeit commented 1 month ago

Sure, I agree that our code doing waitall is questionable, for several reasons. But this was still frustrating to debug, given that there is, as far as I can tell, zero documentation about the crash tracker — even the env var to turn it off isn't documented in the additional configuration section? (I should also note that turning it off via code-based configuration didn't seem to work for us — if I had to guess, probably similar load-order reasons as startup logs.)

"Don't use waitall" or "turn off the crash tracker if you do" are both reasonable solutions, but perhaps that tradeoff can be documented somewhere?

delner commented 1 month ago

Hey @luc-financeit : we're going to disable this crash tracker feature by default until we can test and deploy a suitable alternative. Look for this in our next release. Sorry for the frustration and thanks for your patience!

luc-financeit commented 1 week ago

We've been able to upgrade successfully. Thanks all for your help!