Shopify / statsd-instrument

A StatsD client for Ruby apps. Provides metaprogramming methods to inject StatsD instrumentation into your code.
http://shopify.github.io/statsd-instrument
MIT License
570 stars 94 forks source link

ThreadError: can't alloc thread #291

Closed casperisfine closed 2 years ago

casperisfine commented 3 years ago

We're investigating a weird issue that sometime happens in job workers.

ThreadError: can't alloc thread
~/gems/statsd-instrument-3.1.1/lib/statsd/instrument/batched_udp_sink.rb:68:in `new'
~/gems/statsd-instrument-3.1.1/lib/statsd/instrument/batched_udp_sink.rb:68:in `<<'
~/gems/statsd-instrument-3.1.1/lib/statsd/instrument/batched_udp_sink.rb:36:in `<<'
~/gems/statsd-instrument-3.1.1/lib/statsd/instrument/client.rb:437:in `emit'
~/gems/statsd-instrument-3.1.1/lib/statsd/instrument/client.rb:317:in `ensure in latency'

Looking at MRI this error seem to happen when the main thread is dead: https://github.com/ruby/ruby/blob/0fb782ee38ea37fd5fe8b1f775f8ad866a82a3f0/thread.c#L1027-L1029

Which initially sounded like nonsense, but after digging more, it turns out that this error is raised from another background thread, e.g:

Thread.new do
  loop do
    StatsD.measure ...
  end
end

So the MRI error makes somewhat some sense.

Theory

I assume that somehow the main thread is indeed dead, but that a background thread is still alive and working, causing this issue.

I have to dig more to replicate this kind of scenario, until now I thought the Ruby process would abort once the main thread is dead.

tenderlove commented 3 years ago

I'm not sure how this could happen, but I wonder if there's somehow a race condition where the main thread gets killed (or put in a "dead" state) yet another thread gets scheduled to run even though the main thread is in the process of "dying".

From the slack comments (and just thinking about this) we have the main thread, which has a running heartbeat thread, and the heartbeat thread allocates a new thread to hit statsd. It's the allocation of the statsd thread that fails. Is that correct? @cardy31

casperisfine commented 3 years ago

It's the allocation of the statsd thread that fails. Is that correct?

That is correct yes.

I wonder if there's somehow a race condition where the main thread gets killed (or put in a "dead" state) yet another thread gets scheduled to run even though the main thread is in the process of "dying".

I have the same suspicion yes. I have a few idea how to try to replicate that. I'll try my hand at it tomorrow.

cardy31 commented 3 years ago

@tenderlove yes that seems to be the case. I'm not sure if the failed allocation of the statsd thread is the cause though, or just a symptom.

The worker never ~moves out of~ sends a log for having exited the setup phase, so the main thread being dead with only the heartbeat thread left behind would be consistent with that.

casperisfine commented 3 years ago

Ok, I really couldn't figure out how to replicate this :/

MatthewRBruce commented 2 years ago

We've managed to reproduce this now, it appears to require the following set of circumstances:

1) The thread needs to contain an infinite loop 2) The loop needs to include a begin rescue block 3) Within the begin block it needs another begin with an ensure 4) An exception has to happen within the embeddedensure, which is then rescued by the top level rescue

We satisfy all these conditions when the Ruby VM is shutting down and there is a call to Statsd::Instrument::Client.latency - The emit call within the ensure can raise the ThreadError: can't alloc thread exception when batched_udp_sink tries to spin up a new dispatcher thread.

In the main part of our app, we have an infinite loop similar to this that is measuring RTT that rescue's the exception from the inner block and continues to loop.

def call
      Thread.new do
        loop do
          ping_engine
          sleep(KEEPALIVE_TIMEOUT)
        end
      end
    end

    private

    def ping_engine
        StatsD.distribution("client.ping_rtt") do
          ping_endpoint()
        end
    rescue StandardError => e
      Rails.logger.info("#{e.backtrace}")
    end
  end

There's a good write up here on basically the same issue: https://medium.com/@zanker/finding-a-ruby-bug-with-gdb-56d6b321bc86

And a Ruby bug describing the behaviour too: https://bugs.ruby-lang.org/issues/13882

I've attached a minimal example script that demonstrates this behaviour: test.txt

casperisfine commented 2 years ago

Thanks @MatthewRBruce, not sure how to fix this though.

We could rescue the ThreadError and fallback to emiting from the current thread, but from what I understood of the issue, the process would still be stuck right?

casperisfine commented 2 years ago

Hum, another cause for this might also be that we check Thread#alive? but that doesn't account for Thread#status == "aborting" https://ruby-doc.org/core-2.5.0/Thread.html#method-i-status

MatthewRBruce commented 2 years ago

Ah, but that does give us a potential workaround for now outside of statsd-instrument. In the loop I mentioned above we can alter it to check the thread state in the resuce:

def call
      Thread.new do
        loop do
          ping_engine
          sleep(KEEPALIVE_TIMEOUT)
        end
      end
    end

    private

    def ping_engine
        StatsD.distribution("client.ping_rtt") do
          ping_endpoint()
        end
    rescue StandardError => e
      Rails.logger.info("#{e.backtrace}")
      if Thread.current.state != "run"
        raise e
      end
    end
  end
Mangara commented 2 years ago

We could rescue the ThreadError and fallback to emiting from the current thread, but from what I understood of the issue, the process would still be stuck right?

Instead of rescuing ThreadError, can we check if the main thread is running before trying to create the new thread? That would avoid the original exception from being shadowed, right?