Closed casperisfine closed 2 years ago
I'm not sure how this could happen, but I wonder if there's somehow a race condition where the main thread gets killed (or put in a "dead" state) yet another thread gets scheduled to run even though the main thread is in the process of "dying".
From the slack comments (and just thinking about this) we have the main thread, which has a running heartbeat thread, and the heartbeat thread allocates a new thread to hit statsd. It's the allocation of the statsd thread that fails. Is that correct? @cardy31
It's the allocation of the statsd thread that fails. Is that correct?
That is correct yes.
I wonder if there's somehow a race condition where the main thread gets killed (or put in a "dead" state) yet another thread gets scheduled to run even though the main thread is in the process of "dying".
I have the same suspicion yes. I have a few idea how to try to replicate that. I'll try my hand at it tomorrow.
@tenderlove yes that seems to be the case. I'm not sure if the failed allocation of the statsd thread is the cause though, or just a symptom.
The worker never ~moves out of~ sends a log for having exited the setup phase, so the main thread being dead with only the heartbeat thread left behind would be consistent with that.
Ok, I really couldn't figure out how to replicate this :/
We've managed to reproduce this now, it appears to require the following set of circumstances:
1) The thread needs to contain an infinite loop 2) The loop needs to include a begin rescue block 3) Within the begin block it needs another begin with an ensure 4) An exception has to happen within the embeddedensure, which is then rescued by the top level rescue
We satisfy all these conditions when the Ruby VM is shutting down and there is a call to Statsd::Instrument::Client.latency
- The emit
call within the ensure
can raise the ThreadError: can't alloc thread
exception when batched_udp_sink
tries to spin up a new dispatcher thread.
In the main part of our app, we have an infinite loop similar to this that is measuring RTT that rescue
's the exception from the inner block and continues to loop.
def call
Thread.new do
loop do
ping_engine
sleep(KEEPALIVE_TIMEOUT)
end
end
end
private
def ping_engine
StatsD.distribution("client.ping_rtt") do
ping_endpoint()
end
rescue StandardError => e
Rails.logger.info("#{e.backtrace}")
end
end
There's a good write up here on basically the same issue: https://medium.com/@zanker/finding-a-ruby-bug-with-gdb-56d6b321bc86
And a Ruby bug describing the behaviour too: https://bugs.ruby-lang.org/issues/13882
I've attached a minimal example script that demonstrates this behaviour: test.txt
Thanks @MatthewRBruce, not sure how to fix this though.
We could rescue the ThreadError
and fallback to emiting from the current thread, but from what I understood of the issue, the process would still be stuck right?
Hum, another cause for this might also be that we check Thread#alive?
but that doesn't account for Thread#status == "aborting"
https://ruby-doc.org/core-2.5.0/Thread.html#method-i-status
Ah, but that does give us a potential workaround for now outside of statsd-instrument
. In the loop I mentioned above we can alter it to check the thread state in the resuce
:
def call
Thread.new do
loop do
ping_engine
sleep(KEEPALIVE_TIMEOUT)
end
end
end
private
def ping_engine
StatsD.distribution("client.ping_rtt") do
ping_endpoint()
end
rescue StandardError => e
Rails.logger.info("#{e.backtrace}")
if Thread.current.state != "run"
raise e
end
end
end
We could rescue the ThreadError and fallback to emiting from the current thread, but from what I understood of the issue, the process would still be stuck right?
Instead of rescuing ThreadError
, can we check if the main thread is running before trying to create the new thread? That would avoid the original exception from being shadowed, right?
We're investigating a weird issue that sometime happens in job workers.
Looking at MRI this error seem to happen when the main thread is dead: https://github.com/ruby/ruby/blob/0fb782ee38ea37fd5fe8b1f775f8ad866a82a3f0/thread.c#L1027-L1029
Which initially sounded like nonsense, but after digging more, it turns out that this error is raised from another background thread, e.g:
So the MRI error makes somewhat some sense.
Theory
I assume that somehow the main thread is indeed dead, but that a background thread is still alive and working, causing this issue.
I have to dig more to replicate this kind of scenario, until now I thought the Ruby process would abort once the main thread is dead.