DataDog / dogstatsd-ruby

A Ruby client for DogStatsd
https://www.datadoghq.com/
MIT License
179 stars 137 forks source link

Segfault(s) with 5.5.0 #265

Closed aharpervc closed 2 years ago

aharpervc commented 2 years ago

Hi, I'm in the process of upgrading a Rails app, and updated dogstatd-ruby from 3.3.0 to 5.5.0. The former version had been working fine without issue for years, but now with the upgraded version I'm now seeing a segfault. The only code that's changed is related to this gem. I'm on Rails 5.2.2.1 although hopefully that part shouldn't matter.

My code is basically this (somewhat simplified):

# config\initializers\datadog_instrumentation.rb
require 'datadog/statsd'
STATSD = Datadog::Statsd.new('localhost', 8125, buffer_flush_interval: 60)

class DataDogInstrumentation
  def self.notify app, event, *data
    STATSD.count "#{app}.#{event}", data.first[:count]
  end
end

# newly added to follow the v5 requirement of always calling `close`
at_exit do
  STATSD.close
end
# code elsewhere in the Rails app does this (simplified)
DataDogInstrumentation.notify "my_app", "my_event", count: 5

Here's the log (slightly edited & truncated for brevity; line 30 in the original code is STATSD.close in the at_exit block):

[BUG]
Segmentation fault at 0x00005563f7bb5f21

ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]

-- Ruby level backtrace information ----------------------------------------
/my-apps-path-here/releases/20220923182902/config/initializers/datadog_instrumentation.rb:30:in `block in <main>'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:380:in `close'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/forwarder.rb:109:in `close'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:107:in `stop'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:37:in `stop'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:37:in `synchronize'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:38:in `block in stop'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:38:in `signal'

That's this code:

https://github.com/DataDog/dogstatsd-ruby/blob/5cb9a87154674f66ac740606d58315213ab7775b/lib/datadog/statsd/timer.rb#L33-L42

I thought maybe there was something weird going on with the at_exit block, so I removed that part for testing, but I'm still getting a segfault. The stack trace is a bit different (again, somewhat edited/snipped for clarity/brevity):

/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:76:

[BUG]
Segmentation fault at 0x0000562bf467b1b5

ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]

-- Ruby level backtrace information ----------------------------------------
/my-apps-path-here/releases/20220923194611/config/initializers/datadog_instrumentation.rb:13:in `notify'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:162:in `increment'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:190:in `count'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:430:in `send_stats'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/forwarder.rb:75:in `send_message'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:70:in `add'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:70:in `synchronize'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:76:in `block in add'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:76:in `close'

That's this:

https://github.com/DataDog/dogstatsd-ruby/blob/5cb9a87154674f66ac740606d58315213ab7775b/lib/datadog/statsd/sender.rb#L63-L89

So, different code, but line 76 is message_queue.close if CLOSEABLE_QUEUES, so maybe the same root cause if it's related to closing something.

At first glance this seems like an issue with dogstatsd-ruby itself, especially since there had been no issues at all on the prior version.

Thoughts?

aharpervc commented 2 years ago

A workaround suggested by DataDog technical support was to add the single_thread: true option to the initializer. This does appear to make the segfault(s) go away.

scottopell commented 2 years ago

Thanks for the detailed report! I was able to reproduce this crash with the following ruby program:

require './datadog_instrumentation.rb' # using your snippet from the first post untouched

def do_work
        puts "Counting... " + Time.new.to_s
        DataDogInstrumentation.notify "my_app", "my_event", count: 5
end

def crash
        do_work
        sleep 1
        Process.fork() # leak the child
        do_work
        Process.exit()
end

def dontcrash
        do_work
        sleep 1
        do_work
        Process.exit()
end

puts ARGV.first
if ARGV.first == "crash"
        crash
else
        dontcrash
end

With uname -a:

Linux dev-ruby-repro 4.15.0-193-generic #204-Ubuntu SMP Fri Aug 26 19:25:57 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

I'm able to see a crash almost every single time running ruby main.rb crash against ruby 2.5.3. Upgrading to ruby 2.5.9 fixes the crash for me, I'm completely unable to reproduce it. (Note I'm testing on a slightly different platform, aarch64 vs amd64, so you may see slightly different results)

I would recommend upgrading ruby to2.5.9 version of ruby (or newer, 2.5 and 2.6 are both EOL according to this)

If this is not an option, then single_thread: true should help, however this doesn't completely remove the crash for me, it just makes it less frequent.

aharpervc commented 2 years ago

Thanks for investigating!

I would recommend upgrading ruby to2.5.9 version of ruby (or newer, 2.5 and 2.6 are both EOL according to this)

Oh I am well aware 😆. That is a separate project for us though. I'm satisfied that this is a reasonable answer to close this issue, though.