Closed ioquatix closed 9 years ago
The function that causes the failure is here:
Not sure how that could possibly be failing.. perhaps the ensure block? Some kind of weird flow control due to a timeout somewhere?
Okay, yeah, might be exactly the same issue as #101, because
Hmmmmmm...
Ah, damn it, it's 5:30AM already.
I believe there is a bug in celluloid-io
where sockets are not being deregistered correctly.
I made a test case in RubyDNS which mostly hits the error condition systematically: https://github.com/ioquatix/rubydns/blob/master/spec/rubydns/celluloid_bug_spec.rb
I augmented celluloid-io with some debugging output:
# Wait for the given IO operation to complete
def wait(io, set)
# zomg ugly type conversion :(
unless io.is_a?(::IO) or io.is_a?(OpenSSL::SSL::SSLSocket)
if io.respond_to? :to_io
io = io.to_io
elsif ::IO.respond_to? :try_convert
io = ::IO.try_convert(io)
end
raise TypeError, "can't convert #{io.class} into IO" unless io.is_a?(::IO)
end
puts "-> wait(#{io} -> #{io.inspect}, #{set})"
monitor = @selector.register(io, set)
monitor.value = Task.current
Task.suspend :iowait
puts "<- wait(#{io}, #{set})"
end
I checked several times and I had the argument error but never noticed entering @selector.register
more than once, i.e. -> wait
and <- wait
were balanced for every socket, but still got error.
I added @selector.deregister io
after Task.suspend
and everything is working correctly now.
I'm going to assume that somehow, the register
(https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L201-L232) function needs to be matched with deregister
(https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L241-L257) in order to work correctly in all conditions?
Perhaps you can give me some feedback on what is going on here.
@tarcieri Can you please check this?
@ioquatix oh, I see what might be happening. Yeah, you're right, except:
https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L51
monitor.close
gets called there, except if something else resumes the task first, it won't be
We should probably move monitor.close
out of the reactor, and add it to an ensure block after Task.suspend
Okay, do you want me to submit a PR? I'm pretty keen to get this sorted as it's causing bugs in the latest RubyDNS release.
@ioquatix sure! Delete the monitor.close
out of the reactor, and close it in #wait
I tried adding it to an ensure block like you said but Ruby segfaults. Removing the ensure block, but just leaving it at the end (i.e. monitor.close
as the last line in #wait
) appears to work fine. Any ideas?
Sorry to bump this (don't mean to be annoying), but have you made any progress on this?
I haven't had a chance to take a look
@tarcieri Did you get a chance to take a look?
no
bump :)
No progress, sorry
Bumped into it as well. It seems #101 is not as fixed as we thought... Here is a small gist:
# ran with ruby 2.0.0
require 'celluloid/io'
class Dad
include Celluloid::IO
def initialize
@sock = TCPSocket.new("localhost", 22)
end
def wiriwiri
begin
timeout(2) do
loop do
@sock.wait_readable
end
end
rescue
@sock.wait_readable
end
end
end
Dad.new.wiriwiri
I theory, this script shouldn't blow. I force the endless state of waiting on socket to be readable. When the timeout happens, the socket should be fred from the reactor. But this is not happening. My "theory": the timeout failed message doesn't get prioritized in the mailbox, or does not trigger the reactor select. And the next evaluated message uses the socket which is still under the reactor registered collection.
One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors. This is how most evented implementations handle timeouts: they IO::select for a certain time period. An example:
https://github.com/net-ssh/net-ssh/blob/043bbca9b65cb7f8d862371f66cd4e163287291b/lib/net/ssh/transport/packet_stream.rb#L75 https://github.com/net-ssh/net-ssh/blob/441d07882645114b5acc764d2b0b83930c6cacd7/lib/net/ssh/connection/session.rb#L209
First example checks whether socket is available for reading by passing 0. Second example waits for a definite period on descriptors.
So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.
My proposal would be to pass an optional 3rd parameter to the #wait method: a timeout. when not passed (nil), it defaults to the current behaviour and waits indefinitely (like IO::select). when passed a parameter, it times out, deregisters the socket and return true/false, whether socket is available or not.
Actually, @ioquatix and me have been using the timeout feature because this API is not available (correct me if I'm wrong).
Let me know what you think. I think this would mean some changes on nio4r, probably. Anyway, the "recover from timeout" thing must succeed first.
@ioquatix, I used the monitor.close call in the ensure block and it worked well for me. Can you troubleshoot exactly where does the segfault come from? My reasoning for it is: if an exception occurs while the fiber is paused, what happens to it? Does it get resumed? Can we say for sure that it will always safely close?
@ioquatix , found one way to "force" the stackdump when the monitor.close is in the ensure block: when on timeout loop, you send a SIGTERM to the process it.
Also found out what I wanted: if not inside the ensure block, the time out will not jump to the instructions after task suspending, hence it will not close the monitor and fail with "ArgumentError: this IO is already registered with selector".
here's the stack dump (the relevant nio4r part):
...
7f1b01525000-7f1b017a7000 rw-p 00000000 00:00 0
7f1b017a7000-7f1b017b6000 r-xp 00000000 00:1d 9442576991091956877 /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b017b6000-7f1b019b5000 ---p 0000f000 00:1d 9442576991091956877 /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b5000-7f1b019b6000 rw-p 0000e000 00:1d 9442576991091956877 /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b6000-7f1b019b7000 rw-p 00000000 00:00 0
7f1b019b7000-7f1b019ba000 r-xp 00000000 00:1d 12052790693188749692 /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b019ba000-7f1b01bb9000 ---p 00003000 00:1d 12052790693188749692 /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bb9000-7f1b01bba000 rw-p 00002000 00:1d 12052790693188749692 /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bba000-7f1b01bbb000 r-xp 00000000 fd:00 2093087 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01bbb000-7f1b01dba000 ---p 00001000 fd:00 2093087 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dba000-7f1b01dbb000 rw-p 00000000 fd:00 2093087 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dbb000-7f1b01dc0000 r-xp 00000000 fd:00 2093153 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01dc0000-7f1b01fc0000 ---p 00005000 fd:00 2093153 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc0000-7f1b01fc1000 rw-p 00005000 fd:00 2093153 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc1000-7f1b02006000 r-xp 00000000 fd:00 2093157 /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/openssl.sozsh: abort bundle exec ruby timeout_script.rb
@tarcieri , if this might be helpful.
Correction: "timers" seems to be the issue (not handling SIGTERM properly on some c extension??)
One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors.
...and the first thing anyone wants to do is wrap this up into an event loop that can calculate that timeout based on any number of potential timers.
So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.
Celluloid is a full-featured event system complete with timers, and expects you to use that functionality to implement timeout handling.
Adding timeouts with Celluloid::IO (now that they actually work) is as simple as:
timeout(n) { io_operation_here }
Timeouts are handled by the reactor:
https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L48
@tarcieri , not what I meant at all. I'm fully aware that the reactor handles the timeouts, I meant more of a way of "replacing" native calls with celluloid calls to select, like in the net-ssh example, which would "in theory" involve overwriting the ruby_compat module. This could be then applied to other "native" libraries that use the IO::select approach. But I don't want to move the attention away from the bug above :) Tomorrow I'll get back to this.
Hi Guys, I suggest we have a group Google Hangout to bump heads on this issue and see if we can resolve it. Are you interested?
@TiagoCardoso1983 can you please confirm you were using Timers 4.0.1 for your example above?
@TiagoCardoso1983 I tend to agree with @tarcieri on the timer issue, explicit timeouts are a real pain especially with nested APIs. In RubyDNS, this is used in the resolver:
https://github.com/ioquatix/rubydns/blob/master/lib/rubydns/resolver.rb#L101-L103
In this example, it would be a pain to track the timeout through several read/write calls.
@ioquatix , I mostly agree with both, I'm just getting frustrated with net-ssh (another topic, another already existing issue ;) ). But I've been working around it.
Concerning this, yes, I confirm it was the timers 4.0.1 (and nio4r 1.0.1) It seems by the stack dump it is coming from either hitimes or the nio4r libev extensions.
Forgot also to had this missing dump:
-- C level backtrace information -------------------------------------------
ruby() [0x523d05]
ruby() [0x56b7b6]
ruby(rb_bug+0xb3) [0x56b953]
ruby() [0x4b0bc0]
/lib64/libpthread.so.0(+0xf710) [0x7fe658d18710]
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(ev_io_stop+0x53) [0x7fe64e4baa93] ../libev/ev.c:3659
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(+0x4bcc) [0x7fe64e4b8bcc] monitor.c:130
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x521689]
ruby(rb_rescue2+0x15b) [0x41546b]
ruby() [0x50e319]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x519f23]
ruby() [0x51abc5]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x51226f]
ruby() [0x52fb73]
/lib64/libc.so.6(+0x43bf0) [0x7fe6580f1bf0]
-- Other runtime information -----------------------------------------------
seems also relevant, even though I must admit I have 0 experience with C extensions.
The offender might then have been this line:
https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659
@tarcieri , shouldn't the assert call "bubble up" to ruby?
asserts get disabled when compiled with -DN(O?)DEBUG
Von einem mobilen Gerät gesendet
Am 01.12.2014 um 16:34 schrieb Tiago notifications@github.com:
The offender might then have been this line:
https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659
@tarcieri , shouldn't the assert call "bubble up" to ruby?
— Reply to this email directly or view it on GitHub.
@TiagoCardoso1983 my proposed changes fix the problem https://github.com/ioquatix/celluloid-io/commits/fix-monitor but I'm not convinced it is the correct solution. Can you please try it out and let me know what you think? I can run your example given above and it works. Correctly? Not sure about that, but it doesn't blow up.
Specifically this commit: https://github.com/ioquatix/celluloid-io/commit/6eb2bd8aa765890c5bc28f810b71b8f0d7ac5acf
@ioquatix it does correct the segfault. I do get an IOError: selector is closed, which is probably ok (I don't see where the socket gets closed though). I'd say it's good to go!
So the ensure block triggered the segfault?
Does that mean that one of the C extensions isn't handling ensure properly?
We could merge https://github.com/ioquatix/celluloid-io/commit/6eb2bd8aa765890c5bc28f810b71b8f0d7ac5acf for now and investigate adding an ensure later
My proposed solution using ensure is https://github.com/ioquatix/celluloid-io/commit/a6a8ad7a0c265fdfd49136665c6c82940cd9e9b6
I think merging this but not releasing it might be a good first step. We can test it. I believe that we need to make some specific tests to hit this case so we don't have regressions in the future. We also want to make sure this is the correct logic. I just moved lines around to see if it would work based on @tarcieri's advice, but I don't actually know WHY it fixed the issue completely, or if it HAS fixed the issue..
Regardless of why it corrects the problem, it's still a better, easier-to-understand factoring, IMO
Want to open a PR?
Okay cap'in, I'll open a PArrrrrgh. Yarrrr.
Can't for the life of me figure out why this is happening, albeit very rarely:
e.g. https://travis-ci.org/ioquatix/rubydns/jobs/38832552
I believe we may have discussed this a bit here too: https://github.com/celluloid/celluloid-io/issues/101 but I'm raising a new issue since that stuff is all so old.