Closed ioquatix closed 8 years ago
Looks related to the GSoC work? /cc @digitalextremist
I'm getting literally hundreds of these errors from all "it behaves like an actor" specs.
I feel there is something wrong with 4448543 or thereabouts. It's a bit hard for me to understand this new code because the documentation is non-existent. Can people please document classes and functions with at least one line explaining what they do and how they fit in with the rest of the codebase? Thanks.
Okay so I ran git bisect and found that the first commit to cause this problem is d8fd5b2f5e13fe93392d2d170a9377ad2337242d
I think this problem comes from
+ Specs.assert_no_loose_threads(ex.description) do
which is now wrapping all specs. So the question is, is this the correct behaviour?
I found this line in celluloid-io
ALLOW_SLOW_MAILBOXES = true # TODO: Remove hax.
It's not clear why this is needed but it appears the functionality is broken because this appears to suppress the loose threads but the backtrace is different (off by one):
** /Users/samuel/.rvm/gems/ruby-2.3.0/bundler/gems/celluloid-85f850a41bb5/lib/celluloid/mailbox.rb:63:in `sleep'
** /Users/samuel/.rvm/gems/ruby-2.3.0/bundler/gems/celluloid-85f850a41bb5/lib/celluloid/mailbox.rb:63:in `wait'
** /Users/samuel/.rvm/gems/ruby-2.3.0/bundler/gems/celluloid-85f850a41bb5/lib/celluloid/mailbox.rb:63:in `block in check'
IS not matched by:
if Specs::ALLOW_SLOW_MAILBOXES
next if thread.backtrace[1] =~ /mailbox\.rb/ && thread.backtrace[1] =~ /sleep/
end
So, I've fixed the hack which allows tests to pass, simply by ignoring the sleeping mailboxes. Um, can we have a discussion about this? What is going on here and why is the hack necessary?
No idea what's going on here
Yeah it looks like something @digitalextremist but it's not well documented or explained anywhere.
Sorry for the delay in replying. No, it was not GSoC
related, and no, I didn't introduce it -- I just consolidated its many different forms in different gems. This was introduced by @e2 during test cleaning, and I too have problems with it/am refactoring it when possible
Could you explain what the loose thread is and why it needs to be ignored for celluloid-io?
Loose threads means threads leaking between tests.
It means things weren't cleaned up properly.
This caused hideously hard to debug cases (where some tests failed because another random test failed to kill all its threads).
It wasn't just "test cleaning", it was about getting tests to stop failing randomly.
Here, it looks like the actor wasn't killed properly - that's why it was detected as sleeping on the mailbox.
I wouldn't ignore it - I'd make sure the actor is killed during test teardown.
Apparently some require is where it naught to be. I see the Fanout and IncidentReporter actors there if I boot the system.
Apparently the loose_threads method is not ruling out the root services actors (fanout, incident reporter, etc), hence the slow mailboxes hack (it rules them out). But the condition seems not to work properly for ruby 2.0.0:
next if thread.backtrace[0] =~ /mailbox\.rb/ && thread.backtrace[0] =~ /sleep/
# backtrace
# 2.0.0
/opt/rubies/ruby-2.0.0-p645/lib/ruby/2.0.0/thread.rb:72:in `sleep'
/opt/rubies/ruby-2.0.0-p645/lib/ruby/2.0.0/thread.rb:72:in `block (2 levels) in wait'
/opt/rubies/ruby-2.0.0-p645/lib/ruby/2.0.0/thread.rb:68:in `handle_interrupt'
/opt/rubies/ruby-2.0.0-p645/lib/ruby/2.0.0/thread.rb:68:in `block in wait'
/opt/rubies/ruby-2.0.0-p645/lib/ruby/2.0.0/thread.rb:66:in `handle_interrupt'
/opt/rubies/ruby-2.0.0-p645/lib/ruby/2.0.0/thread.rb:66:in `wait'
/home/taacati1/celluloid/lib/celluloid/mailbox.rb:63:in `block in check'
...
#ruby 2.1.6
/home/taacati1/celluloid/lib/celluloid/mailbox.rb:63:in `sleep'
/home/taacati1/celluloid/lib/celluloid/mailbox.rb:63:in `wait'
/home/taacati1/celluloid/lib/celluloid/mailbox.rb:63:in `block in check'
/home/taacati1/celluloid-io/.bundle/ruby/2.1.0/bundler/gems/timers-41145ed260e4/lib/timers/wait.rb:14:in `for'
/home/taacati1/celluloid/lib/celluloid/mailbox.rb:58:in `check'
...
I think that the "slow mailbox" designation is quite wrong. The whole concept is that one doesn't count service threads as loose threads, right?
Or maybe the fix would be to remove them from the booting altogether. If the goal is to isolate the core functionality from the supervision, It doesn't make sense to be using supervision around here. @digitalextremist , what's the plan exactly?
@TiagoCardoso1983 - the idea was to prevent Celluloid threads from leaking between specs.
The only threads ALLOWED to survive between specs should be core RBX and JRuby threads ONLY.
All Celluloid should be killed before the next spec starts, or debugging failed tests can be hard.
Celluloid should be COMPLETELY shut down between specs - so leaving Celluloid-owned threads is a bad practice, ESPECIALLY with persistent things like fanout, probing, etc.
NOTE: I didn't introduce the Specs::ALLOW_SLOW_MAILBOXES
hack. I don't see a reason for it other than as a workaround for "forcing failing specs to pass". Maybe there was a reason due to rushing for GSoC, but it's best to remove the "mail thread hack" and fix the cause.
If there are threads stuck on the mailbox AFTER a spec, that's a design issue or a bug.
@TiagoCardoso1983 - I didn't bother to check the affected specs here, but when I worked on the tests months ago, I made sure all the tests passed without any Celluloid-created threads still running afterwards ("loose threads").
RBX uses extra threads for timeouts (that's just how they are implemented), while JRuby runs some extra core threads (e.g. in the way it implements a futex, which is managed internally+globally in JRuby).
MRI runs without extra threads, so the thread count after every spec should be 1 - (just the main thread).
What you mentioned about other services (supervision) - removing those from a spec is just a for optimizing how long the tests run. Tests should be green with or without them.
(Unless they're testing e.g. supervision, etc..)
@e2 it's not only for optimizing, they are the reason why the tests were failing. In celluloid-io (not celluloid) all specs are wrapped inside an around hook which calls boot
and shutdown
. Shutdown is kinda asynchronous, i.e. killing the thread is dependent of the current message being processed and wrapped inside a ruby timeout (the most dangerous weapon yada yada).
Celluloid bootstrapping is starting those root services by default, IMO needlessly, this has been subject of debate before, why are they being started is beyond me, that's what the autostart file was here for, I think. Booting the system shouldn't start silent actors by default.
To sum it up, i'm from your opinion, one must not leak threads in between specs, but more important, one should try really hard not to allocate resources which are out of the scope of the specs.
My proposal:
just removed the hack. one does not need to boot the system every time. @digitalextremist , I saw that the commit came from you. I left the one for Celluloid::ZMQ. Any specific reason why you call boot instead of init?
@TiagoCardoso1983 - I agree 100% with you on setting up at little as possible for the tests.
@TiagoCardoso1983 - months ago I started implementing "graceful and complete shutdown", but there was too much integration work in Celluloid at that time.
(I spent more time rebasing that doing anything useful, so I gave up).
Just saying a "clean shutdown" is feasible (though complex), but no one needs it in production code.
That's because no one is interested beyond Celluloid specs (because in production services are expected to run forever, and apps kill threads when they die anyway).
@e2 FWIW, I suggested removing the default at_exit
handler in celluloid/celluloid#698
I think in general the whole lifecycle should use some re-evaluation for a subsequent release. Most people are using Celluloid as a library, not a framework, and we generally have not provided good tooling for using it as the latter.
@tarcieri it seems that removing the at_exit
handler would make people opt in and eventually build their own clean shutdown a la what @e2 suggested. I'm more than ok with that.
I'd really like to push for supervision and the essentials bit non-inclusion in basic celluloid require. They are opt-in services which could be "plugged-in" a la what sequel does. Hell, they don't even need to be in a separate repo.
Most people don't use them anyways. As you said, most celluloid usages don't require a 'system', just a set of tools. Celluloid Actors end up being used as isolated workers (as sidekiq used to do), and why not? One should therefore try to get out of the way and provide the lowest common denominator. Specs will eventually solve themselves.
If there's fear of breaking current clients, I'd like to remember that without sidekiq, celluloid loses a lot of weight in the deparment of breaking people's production environment. And it's still not even 1.0.0. Aim for stability then?
To change subject and cut the conversation short: the build passed. Can someone merge it to master? :)
I agree, I'd like to see celluloid come not as a global system but as something that can be integrated in a modular fashion.
All "it behaviour like a celluloid actor" crashing like this. Any ideas?