bugsnag / bugsnag-ruby

BugSnag error monitoring & reporting software for rails, sinatra, rack and ruby
https://docs.bugsnag.com/platforms/ruby
MIT License
246 stars 174 forks source link

Connection pool exhaustion from delivery thread serializing reports #812

Open evanlok opened 6 months ago

evanlok commented 6 months ago

Describe the bug

Bugsnag's delivery thread can checkout an ActiveRecord database connection that is never returned to the pool. This causes subsequent connections to time out waiting for a connection from the pool.

Steps to reproduce

  1. An ActiveJob job receives an ActiveRecord instance as a job argument
  2. The job raises an exception
  3. Bugsnag creates a report that contains the serialized job arguments
  4. The ActiveRecord class overrides to_s which is called when it gets serialized by bugsnag. The to_s method contains code that queries the database.
  5. Bugsnag's delivery thread checks out a database connection from the connection pool when the to_s method is invoked by Bugsnag::Cleaner#traverse_object
  6. The connection pool has now permanently lost a connection to the bugsnag delivery thread

Environment

Example code snippet

# Run a single instance of sidekiq worker with concurrency 2

class Foo < ApplicationRecord
  def to_s
    Bar.first # Database query triggered during bugsnag report delivery
    "foo"
  end
end

class Bar < ApplicationRecord
end

class MyJob < ApplicationJob
  def perform(foo)
     Bar.first # Query will time out waiting for connection
     sleep 10
     raise "Fail"
  end
end

Foo.create
Bar.create
10.times { MyJob.perform_later(Foo.first) }
Error messages: ``` ```
mclack commented 5 months ago

Hi @evanlok

Apologies for the delay in response, and thanks for your patience.

In order for us to investigate this further, can you please confirm whether our understanding of what you are doing in the example is correct?

  1. You're running Sidekiq supporting 2 concurrent jobs
  2. You're enqueueing 10 jobs to run, passing in the reloaded Foo instance
  3. The jobs are sleeping for 10 seconds, then raising a failure
  4. When the failure is raised, BugSnag serializes the argument, which is a Foo
  5. Serializing the argument calls to_s on it
  6. This makes further network requests, and doesn’t return the connection to the pool

Please elaborate on anything here that may not be exactly correct.

evanlok commented 5 months ago

Yes, that is correct.

clr182 commented 5 months ago

Hi @evanlok

Can you please elaborate on why you are making a database query in to_s ? We generally wouldn't recommend making a database call in such a way as to_s is a standard method in Ruby used to cast the thing you're looking at to a string representation. You wouldn't really expect this to do anything other than represent the in-memory object.

evanlok commented 5 months ago

The to_s method was querying for related configuration to determine the string output. We removed the database call from the method to resolve the issue but it's possible this can happen again in the future. It was quite difficult to debug initially so if there's a way bugsnag can prevent or warn when this occurs it would be helpful.

mclack commented 4 months ago

Hi @evanlok

Thanks for your patience on this.

We now have a task on our backlog to discuss and explore our options here when priorities allow. I can't currently provide an ETA on when this could be looked at, but we will make sure to post any further updates on this thread regarding future discussions or developments on this.