ArgumentError: this IO is already registered with selector

ioquatix commented 9 years ago

Can't for the life of me figure out why this is happening, albeit very rarely:

ArgumentError: this IO is already registered with selector
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `register'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `wait'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:21:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io.rb:53:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:13:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:28:in `rescue in recvfrom'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:20:in `recvfrom'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:151:in `try_udp_server'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:124:in `try_server'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:99:in `block (2 levels) in dispatch_request'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid.rb:407:in `block in timeout'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/actor.rb:252:in `timeout'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid.rb:406:in `timeout'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:98:in `block in dispatch_request'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:208:in `block in each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:205:in `each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:205:in `each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:92:in `dispatch_request'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:62:in `query'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:68:in `block in addresses_for'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:67:in `times'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:67:in `addresses_for'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:63:in `dispatch'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'

e.g. https://travis-ci.org/ioquatix/rubydns/jobs/38832552

I believe we may have discussed this a bit here too: https://github.com/celluloid/celluloid-io/issues/101 but I'm raising a new issue since that stuff is all so old.

ioquatix commented 9 years ago

The function that causes the failure is here:

https://github.com/ioquatix/rubydns/blob/7cedb543ab7d28561ff0f4eed509fdfab7ca5199/lib/rubydns/resolver.rb#L146-L162

Not sure how that could possibly be failing.. perhaps the ensure block? Some kind of weird flow control due to a timeout somewhere?

ioquatix commented 9 years ago

Okay, yeah, might be exactly the same issue as #101, because

https://github.com/ioquatix/rubydns/blob/7cedb543ab7d28561ff0f4eed509fdfab7ca5199/lib/rubydns/resolver.rb#L98-L100

Hmmmmmm...

ioquatix commented 9 years ago

Ah, damn it, it's 5:30AM already.

ioquatix commented 9 years ago

I believe there is a bug in celluloid-io where sockets are not being deregistered correctly.

I made a test case in RubyDNS which mostly hits the error condition systematically: https://github.com/ioquatix/rubydns/blob/master/spec/rubydns/celluloid_bug_spec.rb

I augmented celluloid-io with some debugging output:

      # Wait for the given IO operation to complete
      def wait(io, set)
        # zomg ugly type conversion :(
        unless io.is_a?(::IO) or io.is_a?(OpenSSL::SSL::SSLSocket)
          if io.respond_to? :to_io
            io = io.to_io
          elsif ::IO.respond_to? :try_convert
            io = ::IO.try_convert(io)
          end

          raise TypeError, "can't convert #{io.class} into IO" unless io.is_a?(::IO)
        end

        puts "-> wait(#{io} -> #{io.inspect}, #{set})"
        monitor = @selector.register(io, set)
        monitor.value = Task.current
        Task.suspend :iowait        
        puts "<- wait(#{io}, #{set})"
      end

I checked several times and I had the argument error but never noticed entering @selector.register more than once, i.e. -> wait and <- wait were balanced for every socket, but still got error.

I added @selector.deregister io after Task.suspend and everything is working correctly now.

I'm going to assume that somehow, the register (https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L201-L232) function needs to be matched with deregister (https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L241-L257) in order to work correctly in all conditions?

Perhaps you can give me some feedback on what is going on here.

ioquatix commented 9 years ago

@tarcieri Can you please check this?

tarcieri commented 9 years ago

@ioquatix oh, I see what might be happening. Yeah, you're right, except:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L51

monitor.close gets called there, except if something else resumes the task first, it won't be

We should probably move monitor.close out of the reactor, and add it to an ensure block after Task.suspend

ioquatix commented 9 years ago

Okay, do you want me to submit a PR? I'm pretty keen to get this sorted as it's causing bugs in the latest RubyDNS release.

tarcieri commented 9 years ago

@ioquatix sure! Delete the monitor.close out of the reactor, and close it in #wait

ioquatix commented 9 years ago

I tried adding it to an ensure block like you said but Ruby segfaults. Removing the ensure block, but just leaving it at the end (i.e. monitor.close as the last line in #wait) appears to work fine. Any ideas?

ioquatix commented 9 years ago

Sorry to bump this (don't mean to be annoying), but have you made any progress on this?

tarcieri commented 9 years ago

I haven't had a chance to take a look

ioquatix commented 9 years ago

@tarcieri Did you get a chance to take a look?

tarcieri commented 9 years ago

no

ioquatix commented 9 years ago

bump :)

tarcieri commented 9 years ago

No progress, sorry

HoneyryderChuck commented 9 years ago

Bumped into it as well. It seems #101 is not as fixed as we thought... Here is a small gist:

# ran with ruby 2.0.0
require 'celluloid/io'                                      

class Dad                                                   
  include Celluloid::IO                                     

  def initialize                                            
    @sock = TCPSocket.new("localhost", 22)                  
  end                                                       

  def wiriwiri                                              
    begin                                                   
      timeout(2) do                                         
        loop do                                             
          @sock.wait_readable                               
        end                                                 
      end                                                   
    rescue                                                  
      @sock.wait_readable                                   
    end                                                     
  end                                                       
end                                                         

Dad.new.wiriwiri

I theory, this script shouldn't blow. I force the endless state of waiting on socket to be readable. When the timeout happens, the socket should be fred from the reactor. But this is not happening. My "theory": the timeout failed message doesn't get prioritized in the mailbox, or does not trigger the reactor select. And the next evaluated message uses the socket which is still under the reactor registered collection.

One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors. This is how most evented implementations handle timeouts: they IO::select for a certain time period. An example:

https://github.com/net-ssh/net-ssh/blob/043bbca9b65cb7f8d862371f66cd4e163287291b/lib/net/ssh/transport/packet_stream.rb#L75 https://github.com/net-ssh/net-ssh/blob/441d07882645114b5acc764d2b0b83930c6cacd7/lib/net/ssh/connection/session.rb#L209

First example checks whether socket is available for reading by passing 0. Second example waits for a definite period on descriptors.

So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.

My proposal would be to pass an optional 3rd parameter to the #wait method: a timeout. when not passed (nil), it defaults to the current behaviour and waits indefinitely (like IO::select). when passed a parameter, it times out, deregisters the socket and return true/false, whether socket is available or not.

Actually, @ioquatix and me have been using the timeout feature because this API is not available (correct me if I'm wrong).

Let me know what you think. I think this would mean some changes on nio4r, probably. Anyway, the "recover from timeout" thing must succeed first.

HoneyryderChuck commented 9 years ago

@ioquatix, I used the monitor.close call in the ensure block and it worked well for me. Can you troubleshoot exactly where does the segfault come from? My reasoning for it is: if an exception occurs while the fiber is paused, what happens to it? Does it get resumed? Can we say for sure that it will always safely close?

HoneyryderChuck commented 9 years ago

@ioquatix , found one way to "force" the stackdump when the monitor.close is in the ensure block: when on timeout loop, you send a SIGTERM to the process it.

Also found out what I wanted: if not inside the ensure block, the time out will not jump to the instructions after task suspending, hence it will not close the monitor and fail with "ArgumentError: this IO is already registered with selector".

HoneyryderChuck commented 9 years ago

here's the stack dump (the relevant nio4r part):

...
7f1b01525000-7f1b017a7000 rw-p 00000000 00:00 0
7f1b017a7000-7f1b017b6000 r-xp 00000000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b017b6000-7f1b019b5000 ---p 0000f000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b5000-7f1b019b6000 rw-p 0000e000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b6000-7f1b019b7000 rw-p 00000000 00:00 0
7f1b019b7000-7f1b019ba000 r-xp 00000000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b019ba000-7f1b01bb9000 ---p 00003000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bb9000-7f1b01bba000 rw-p 00002000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bba000-7f1b01bbb000 r-xp 00000000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01bbb000-7f1b01dba000 ---p 00001000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dba000-7f1b01dbb000 rw-p 00000000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dbb000-7f1b01dc0000 r-xp 00000000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01dc0000-7f1b01fc0000 ---p 00005000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc0000-7f1b01fc1000 rw-p 00005000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc1000-7f1b02006000 r-xp 00000000 fd:00 2093157                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/openssl.sozsh: abort      bundle exec ruby timeout_script.rb

@tarcieri , if this might be helpful.

HoneyryderChuck commented 9 years ago

Correction: "timers" seems to be the issue (not handling SIGTERM properly on some c extension??)

tarcieri commented 9 years ago

One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors.

...and the first thing anyone wants to do is wrap this up into an event loop that can calculate that timeout based on any number of potential timers.

So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.

Celluloid is a full-featured event system complete with timers, and expects you to use that functionality to implement timeout handling.

Adding timeouts with Celluloid::IO (now that they actually work) is as simple as:

timeout(n) { io_operation_here }

Timeouts are handled by the reactor:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L48

HoneyryderChuck commented 9 years ago

@tarcieri , not what I meant at all. I'm fully aware that the reactor handles the timeouts, I meant more of a way of "replacing" native calls with celluloid calls to select, like in the net-ssh example, which would "in theory" involve overwriting the ruby_compat module. This could be then applied to other "native" libraries that use the IO::select approach. But I don't want to move the attention away from the bug above :) Tomorrow I'll get back to this.

ioquatix commented 9 years ago

Hi Guys, I suggest we have a group Google Hangout to bump heads on this issue and see if we can resolve it. Are you interested?

ioquatix commented 9 years ago

@TiagoCardoso1983 can you please confirm you were using Timers 4.0.1 for your example above?

ioquatix commented 9 years ago

@TiagoCardoso1983 I tend to agree with @tarcieri on the timer issue, explicit timeouts are a real pain especially with nested APIs. In RubyDNS, this is used in the resolver:

https://github.com/ioquatix/rubydns/blob/master/lib/rubydns/resolver.rb#L101-L103

In this example, it would be a pain to track the timeout through several read/write calls.

HoneyryderChuck commented 9 years ago

@ioquatix , I mostly agree with both, I'm just getting frustrated with net-ssh (another topic, another already existing issue ;) ). But I've been working around it.

Concerning this, yes, I confirm it was the timers 4.0.1 (and nio4r 1.0.1) It seems by the stack dump it is coming from either hitimes or the nio4r libev extensions.

HoneyryderChuck commented 9 years ago

Forgot also to had this missing dump:

HoneyryderChuck commented 9 years ago

-- C level backtrace information -------------------------------------------
ruby() [0x523d05]
ruby() [0x56b7b6]
ruby(rb_bug+0xb3) [0x56b953]
ruby() [0x4b0bc0]
/lib64/libpthread.so.0(+0xf710) [0x7fe658d18710]
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(ev_io_stop+0x53) [0x7fe64e4baa93] ../libev/ev.c:3659
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(+0x4bcc) [0x7fe64e4b8bcc] monitor.c:130
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x521689]
ruby(rb_rescue2+0x15b) [0x41546b]
ruby() [0x50e319]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x519f23]
ruby() [0x51abc5]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x51226f]
ruby() [0x52fb73]
/lib64/libc.so.6(+0x43bf0) [0x7fe6580f1bf0]

-- Other runtime information -----------------------------------------------

seems also relevant, even though I must admit I have 0 experience with C extensions.

HoneyryderChuck commented 9 years ago

The offender might then have been this line:

https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659

@tarcieri , shouldn't the assert call "bubble up" to ruby?

Asmod4n commented 9 years ago

asserts get disabled when compiled with -DN(O?)DEBUG

Von einem mobilen Gerät gesendet

Am 01.12.2014 um 16:34 schrieb Tiago notifications@github.com:

The offender might then have been this line:

https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659

@tarcieri , shouldn't the assert call "bubble up" to ruby?

— Reply to this email directly or view it on GitHub.

ioquatix commented 9 years ago

@TiagoCardoso1983 my proposed changes fix the problem https://github.com/ioquatix/celluloid-io/commits/fix-monitor but I'm not convinced it is the correct solution. Can you please try it out and let me know what you think? I can run your example given above and it works. Correctly? Not sure about that, but it doesn't blow up.

ioquatix commented 9 years ago

Specifically this commit: https://github.com/ioquatix/celluloid-io/commit/6eb2bd8aa765890c5bc28f810b71b8f0d7ac5acf

HoneyryderChuck commented 9 years ago

@ioquatix it does correct the segfault. I do get an IOError: selector is closed, which is probably ok (I don't see where the socket gets closed though). I'd say it's good to go!

tarcieri commented 9 years ago

So the ensure block triggered the segfault?

Does that mean that one of the C extensions isn't handling ensure properly?

tarcieri commented 9 years ago

We could merge https://github.com/ioquatix/celluloid-io/commit/6eb2bd8aa765890c5bc28f810b71b8f0d7ac5acf for now and investigate adding an ensure later

ioquatix commented 9 years ago

My proposed solution using ensure is https://github.com/ioquatix/celluloid-io/commit/a6a8ad7a0c265fdfd49136665c6c82940cd9e9b6

ioquatix commented 9 years ago

I think merging this but not releasing it might be a good first step. We can test it. I believe that we need to make some specific tests to hit this case so we don't have regressions in the future. We also want to make sure this is the correct logic. I just moved lines around to see if it would work based on @tarcieri's advice, but I don't actually know WHY it fixed the issue completely, or if it HAS fixed the issue..

tarcieri commented 9 years ago

Regardless of why it corrects the problem, it's still a better, easier-to-understand factoring, IMO

tarcieri commented 9 years ago

Want to open a PR?

ioquatix commented 9 years ago

Okay cap'in, I'll open a PArrrrrgh. Yarrrr.

celluloid / celluloid-io

ArgumentError: this IO is already registered with selector #121