Hanging at monitor#close, or at Kernel.select.

digitalextremist commented 9 years ago

This seems to be in nio4r but I cannot see what changed, or trace it into that library completely. But this release of Celluloid::IO is dying intermittently, more often than surviving, within TCPSocket, in accept at wait_readable and in that, at the mailbox.reactor.wait_readable and finally in that, at monitor.close inside the ensure block ...

With MRI 2.2.1 this can be interrupted ( but failing obviously ) by SIGINT, but with rbx-2.5.8 this even prevents SIGINT and the rspec process needs SIGTERM to fully be killed. I think the thing that opens this a bit wider is that on jRuby 1.7.20 sending SIGINT does work, and reveals a suppressed STDERR message: Bad file descriptor ... this is in the case of failing at IO::SSLServer ...

But then it gets weirder... under jRuby the SSLServer test always passes, if run by itself. Under rbx-2.5.8 and MRI 2.2.1 it never passes, alone or with other tests before it, and it dies inside the ensure block mentioned above ( in the case of rbx-2.5.8 ... elsewhere for MRI see first comment )

In strace it hangs up at:

jRuby 1.7.20 ... futex(0x7fce8dbab9d0, FUTEX_WAIT, 18860, NULL)
rbx 2.5.8 ... futex(0x6a86a4, FUTEX_WAIT_PRIVATE, 25, NULL)
MRI 2.2.1 ... select(12, [11], NULL, NULL, NULL)

I can't figure out what changed yet. It's hard to bisect because changes out in Celluloid at earlier points break tests that succeeded in the past. Any ideas @tarcieri?

This is the cause of the 10m wait time for tests to die on 0.17.0 of this gem.

digitalextremist commented 9 years ago

Correction: On MRI 2.2.1 it's actually hanging at Kernel.select([io]) hence the strace output.

ioquatix commented 9 years ago

I made some change to how TCPSocket works somewhere.

ioquatix commented 9 years ago

https://github.com/celluloid/celluloid-io/commit/3d3f10a467c7b900ab4fdec7dec11901fea6b4b4 but I think that actually fixes problems not creates them :/

digitalextremist commented 9 years ago

Yeah, that's not it. I reverted that one line and it's still there. Plus, it was fine still as of a month after your change.

digitalextremist commented 9 years ago

It really seems like the issue is outside Celluloid::IO

ioquatix commented 9 years ago

if you want heavy IO tests use celluloid-dns it does a crap load of IO in some of the tests.

digitalextremist commented 9 years ago

I added a timeout to be able to see the already failing errors, and it's test clobbering each other's port. So I'm randomizing per-test. No idea how this "suddenly" started... you'd think it'd have been going on all along.

ioquatix commented 9 years ago

Yeah, I had some issues with tests running servers on the same port, I just gave each test it's own port.

digitalextremist commented 9 years ago

I have it doing that now... and sometimes it will completely pass now. Sometimes. I wonder what changed, or if it's always been like this. The issue seems to be the ECONNREFUSED mystery error in ssl_socket_spec.rb

digitalextremist commented 9 years ago

It's a miracle*

Unlikely but inevitable An improbable given

GOT IT :shipit:

ioquatix commented 9 years ago

What was the bug

digitalextremist commented 9 years ago

It was with the tests themselves, but it reveals an issue with SSL support that might come up. The server takes about 0.126 on CI to spin up. The method we were using didn't actually catch the spinup. There is a Thread.pass approach ( which I left there ) and it is noted in the test that it's probably responsible for ECONNREFUSED ... which it is. It caused this interplay between port assignment between tests, and inability to use the port once obtained, in the test that was failing. I changed the port assignment method to test if the port was actually available, then assign it... otherwise it picks another port ( up to N retries ) before it fails with EADDRINUSE but I inserted the number of retries into the error message. It was showing 45-70 retry attempts at times, and getting either EADDRINUSE or ECONNREFUSED ... that is, after I added a timeout wrapping all the tests in an around(:each) {} ...

The ultimate solution came down to inserting that sleep 0.222 right after an SSL server was instantiated in tests, just before the client tried to open a socket to it ( which is where the tests previously locked up ... because the test instance had both client and server on one process, and the thread was locking and if it wasn't locking, it was hitting a failure either for port assignment or connection refused because the server had not spun up ) ... it was only one issue, but it ended up taking a lot to find it.... even including strace

ioquatix commented 9 years ago

So the SSL socket has some kind of timeout/pause which is causing other tests to fail because the port isn't released?

digitalextremist commented 9 years ago

It is both unreleased and then caused EADDRINUSE and unprepared, causing ECONNREFUSED ... without sleep 0.222 for now. I think we need to handle this in the ducktype itself, but for now I fixed the test. In general, SSL needs some attention.

digitalextremist commented 9 years ago

I didn't change the ducktype because the test failure was not technically even using Celluloid::IO ... it was making sure it failed without Celluloid::IO

Asmod4n commented 9 years ago

So its a bug in ruby itself? The SSL Server constructor returns before it's useable?

digitalextremist commented 9 years ago

It seems so. In all Rubies tested. It is confusing as to how, because it is implemented differently in each Ruby, but ultimately the same result.

Asmod4n commented 9 years ago

Or wait, Are the Server and Client in another thread? If so, you could use a condition variable or something to wait till you fire up the Client to connect to the Server.

tarcieri commented 9 years ago

Yeah, arbitrary values for sleep are a definite smell. I'd suggest doing what @Asmod4n and using actual thread synchronization.

I think I'd actually suggest something like a Queue with a timeout (so problems with tests don't indefinitely hang the test suite). The semantics of ConditionVariables are a bit tricky, as they're effectively level-triggered interrupts, and what you want here is edge-triggered, which is what a Queue provides. Unfortunately Ruby's built-in Queue does not have native timeout support, though ConditionVariable does.

I believe @mperham already wrote something like a QueueWithTimeout for Sidekiq tests. I think it'd be nice if Celluloid had something like that built-in.

mperham commented 9 years ago

https://github.com/mperham/connection_pool/blob/master/lib/connection_pool/timed_stack.rb#L29

digitalextremist commented 9 years ago

I do think something like a native QueueWithTimeout in our test suite would be good. But I do also think using a Condition could be enough. The trouble in this case, is knowing when to signal the server is ready. There are false-positives given by the SSL server class. I've implemented timeouts, but determining when the server truly is ready and only then signalling is a problem.

digitalextremist commented 9 years ago

I'll check that out. Thanks @mperham

mperham commented 9 years ago

Where "connection" can be any ruby object, really.

celluloid / celluloid-io

Hanging at monitor#close, or at Kernel.select. #150