Closed digitalextremist closed 9 years ago
Correction: On MRI 2.2.1
it's actually hanging at Kernel.select([io])
hence the strace
output.
I made some change to how TCPSocket works somewhere.
https://github.com/celluloid/celluloid-io/commit/3d3f10a467c7b900ab4fdec7dec11901fea6b4b4 but I think that actually fixes problems not creates them :/
Yeah, that's not it. I reverted that one line and it's still there. Plus, it was fine still as of a month after your change.
It really seems like the issue is outside Celluloid::IO
if you want heavy IO tests use celluloid-dns it does a crap load of IO in some of the tests.
I added a timeout to be able to see the already failing errors, and it's test clobbering each other's port. So I'm randomizing per-test. No idea how this "suddenly" started... you'd think it'd have been going on all along.
Yeah, I had some issues with tests running servers on the same port, I just gave each test it's own port.
I have it doing that now... and sometimes it will completely pass now. Sometimes. I wonder what changed, or if it's always been like this. The issue seems to be the ECONNREFUSED
mystery error in ssl_socket_spec.rb
It's a miracle*
Unlikely but inevitable An improbable given
GOT IT :shipit:
What was the bug
It was with the tests themselves, but it reveals an issue with SSL support that might come up. The server takes about 0.126
on CI to spin up. The method we were using didn't actually catch the spinup. There is a Thread.pass
approach ( which I left there ) and it is noted in the test that it's probably responsible for ECONNREFUSED
... which it is. It caused this interplay between port assignment between tests, and inability to use the port once obtained, in the test that was failing. I changed the port assignment method to test if the port was actually available, then assign it... otherwise it picks another port ( up to N
retries ) before it fails with EADDRINUSE
but I inserted the number of retries into the error message. It was showing 45-70 retry attempts at times, and getting either EADDRINUSE
or ECONNREFUSED
... that is, after I added a timeout wrapping all the tests in an around(:each) {}
...
The ultimate solution came down to inserting that sleep 0.222
right after an SSL
server was instantiated in tests, just before the client tried to open a socket to it ( which is where the tests previously locked up ... because the test instance had both client and server on one process, and the thread was locking and if it wasn't locking, it was hitting a failure either for port assignment or connection refused because the server had not spun up ) ... it was only one issue, but it ended up taking a lot to find it.... even including strace
So the SSL socket has some kind of timeout/pause which is causing other tests to fail because the port isn't released?
It is both unreleased and then caused EADDRINUSE
and unprepared, causing ECONNREFUSED
... without sleep 0.222
for now. I think we need to handle this in the ducktype itself, but for now I fixed the test. In general, SSL needs some attention.
I didn't change the ducktype because the test failure was not technically even using Celluloid::IO
... it was making sure it failed without Celluloid::IO
So its a bug in ruby itself? The SSL Server constructor returns before it's useable?
It seems so. In all Rubies tested. It is confusing as to how, because it is implemented differently in each Ruby, but ultimately the same result.
Or wait, Are the Server and Client in another thread? If so, you could use a condition variable or something to wait till you fire up the Client to connect to the Server.
Yeah, arbitrary values for sleep
are a definite smell. I'd suggest doing what @Asmod4n and using actual thread synchronization.
I think I'd actually suggest something like a Queue
with a timeout (so problems with tests don't indefinitely hang the test suite). The semantics of ConditionVariables
are a bit tricky, as they're effectively level-triggered interrupts, and what you want here is edge-triggered, which is what a Queue
provides. Unfortunately Ruby's built-in Queue does not have native timeout support, though ConditionVariable
does.
I believe @mperham already wrote something like a QueueWithTimeout
for Sidekiq tests. I think it'd be nice if Celluloid had something like that built-in.
I do think something like a native QueueWithTimeout
in our test suite would be good. But I do also think using a Condition
could be enough. The trouble in this case, is knowing when to signal the server is ready. There are false-positives given by the SSL
server class. I've implemented timeouts, but determining when the server truly is ready and only then signalling is a problem.
I'll check that out. Thanks @mperham
Where "connection" can be any ruby object, really.
This seems to be in
nio4r
but I cannot see what changed, or trace it into that library completely. But this release ofCelluloid::IO
is dying intermittently, more often than surviving, withinTCPSocket
, inaccept
atwait_readable
and in that, at themailbox.reactor.wait_readable
and finally in that, atmonitor.close
inside theensure block
...With
MRI 2.2.1
this can be interrupted ( but failing obviously ) by SIGINT, but withrbx-2.5.8
this even preventsSIGINT
and therspec
process needsSIGTERM
to fully be killed. I think the thing that opens this a bit wider is that onjRuby 1.7.20
sendingSIGINT
does work, and reveals a suppressedSTDERR
message:Bad file descriptor
... this is in the case of failing atIO::SSLServer
...But then it gets weirder... under
jRuby
theSSLServer
test always passes, if run by itself. Underrbx-2.5.8
andMRI 2.2.1
it never passes, alone or with other tests before it, and it dies inside theensure
block mentioned above ( in the case ofrbx-2.5.8
... elsewhere forMRI
see first comment )In
strace
it hangs up at:jRuby 1.7.20
...futex(0x7fce8dbab9d0, FUTEX_WAIT, 18860, NULL)
rbx 2.5.8
...futex(0x6a86a4, FUTEX_WAIT_PRIVATE, 25, NULL)
MRI 2.2.1
...select(12, [11], NULL, NULL, NULL)
I can't figure out what changed yet. It's hard to
bisect
because changes out inCelluloid
at earlier points break tests that succeeded in the past. Any ideas @tarcieri?This is the cause of the
10m
wait time for tests to die on0.17.0
of this gem.