chuckremes / ffi-rzmq

FFI bindings for ZeroMQ so the library can be used under JRuby and other FFI-compliant ruby runtimes
242 stars 62 forks source link

Correct way to exit #127

Closed black-snow closed 6 years ago

black-snow commented 7 years ago

Hey @chuckremes,

what's the correct way to close sockets / kill contexts / exit programs with ffi-rzmq? For ruby has no destructors I wonder what's the right way to close sockets.

I have a method that uses zmq to get data from another service and I'd like to keep that socket open after the method call for I'll use it often. Do I have to close sockets in order to prevent leaks or will the GC do this for me? I could move the responsibility for closing the socket out of the context but it shouldn't be the caller's responsibility.

chuckremes commented 7 years ago

Sockets have a finalizer defined, so the GC should close them when no more references to the socket remain. However, there is no guarantee WHEN or HOW OFTEN the GC will run, so sockets could hang around a long time. If you are opening and abandoning a lot of sockets, it is recommended that you explicitly close them yourself. If you only open a few sockets, then relying on the GC is probably sufficient.

See here: https://github.com/chuckremes/ffi-rzmq/blob/master/lib/ffi-rzmq/socket.rb#L599

Likewise, a Context also has a finalizer defined. See here:

https://github.com/chuckremes/ffi-rzmq/blob/master/lib/ffi-rzmq/context.rb#L124

However, it is very rare for a program to create/terminate more than a single context in the life of the program. A Context should be created once and reused until the program exits. Sometimes the program will hang on exit (usually due to sockets having a ZMQ_LINGER value set), so you need to be careful. Zeromq isn't really designed for managed languages like Ruby so we can't be as carefree as we'd like.

Unless you have further questions, please close this ticket.

black-snow commented 7 years ago

Thanks for your detailed answer. So setting linger is always a best practice, eh?

black-snow commented 6 years ago

Hey there @chuckremes

I'm still having issues. When I use ffi-rzmq my rack server hangs on exit and sometime stops responding. When I hit crtl+c a couple times I get:

ERROR ThreadError: Attempt to unlock a mutex which is locked by another thread
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/lock.rb:22:in `unlock'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/lock.rb:22:in `ensure in call'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/lock.rb:23:in `call'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/content_length.rb:15:in `call'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/handler/webrick.rb:88:in `service'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/httpserver.rb:138:in `service'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/httpserver.rb:94:in `run'
        /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:294:in `block in start_thread'

a couple times and finally

/Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/handler/webrick.rb:48:in `shutdown': undefined method `shutdown' for nil:NilClass (NoMethodError)
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/server.rb:281:in `block in start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:206:in `call'

        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:206:in `join'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:206:in `block (2 levels) in start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:206:in `each'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:206:in `block in start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:32:in `start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/webrick/server.rb:162:in `start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/handler/webrick.rb:34:in `run'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/rack-1.6.8/lib/rack/server.rb:287:in `start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.10/lib/rails/commands/server.rb:80:in `start'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.10/lib/rails/commands/commands_tasks.rb:80:in `block in server'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.10/lib/rails/commands/commands_tasks.rb:75:in `tap'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.10/lib/rails/commands/commands_tasks.rb:75:in `server'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.10/lib/rails/commands/commands_tasks.rb:39:in `run_command!'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.10/lib/rails/commands.rb:17:in `<top (required)>'
        from /Users/Shared/xxx/xxx/bin/rails:8:in `require'
        from /Users/Shared/xxx/xxx/bin/rails:8:in `<top (required)>'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/lib/spring/client/rails.rb:28:in `load'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/lib/spring/client/rails.rb:28:in `call'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/lib/spring/client/command.rb:7:in `call'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/lib/spring/client.rb:30:in `run'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/bin/spring:49:in `<top (required)>'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/lib/spring/binstub.rb:31:in `load'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/spring-2.0.2/lib/spring/binstub.rb:31:in `<top (required)>'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:69:in `require'
        from /Users/xxx/.rbenv/versions/2.2.3/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:69:in `require'
        from /Users/xxx/xxx/xxx/bin/spring:13:in `<top (required)>'
        from bin/rails:3:in `load'
        from bin/rails:3:in `<main>'
chuckremes commented 6 years ago

I don't see zmq in the backtrace so it isn't obvious to me what's going on. If it's hanging on exit, how are you shutting down your zeromq sockets? Are you doing so explicitly or relying on the finalizer to do it for you?

Have you tried putting any logging into your systems to indicate when a zeromq socket is open/closed and keep a running count? This could help narrow down the issue if you have 1+ sockets that haven't been closed and no linger is set for them.

Honestly, without more info it's impossible for me to help here. If you can answer some of the above questions we can maybe narrow it down.

black-snow commented 6 years ago

Wow, thanks for the quick reply!

I close all my sockets right after I use them (unless they timeout). I have a service-class that keeps a context at instance level.

atm linger is set to 100 sock.setsockopt(ZMQ::LINGER, 100) - gonna try and see if 0 helps.

black-snow commented 6 years ago

Changing linger didn't do anything. However, I noticed that the server keeps running fine until I try to load another page. Ajax requests work but no full page loads ... And it only occurs when using ffi-rzmq.

After this it hangs - can't see why.

Started GET "/admin/xxx" for ::1 at 2018-02-05 18:37:31 +0100
Processing by Admin::xxxController#index as HTML
  User Load (1.1ms)  SELECT  `users`.* FROM `users` WHERE `users`.`id` = 1  ORDER BY `users`.`id` ASC LIMIT 1
black-snow commented 6 years ago

Might be this https://github.com/rails/rails/issues/10517

black-snow commented 6 years ago

Here's what I have (stripped down version):

# encoding: utf-8

class PredictionService

    SOCK_TIMEOUT = 2
    STATUS_KEEP_ALIVE = 5.minutes

    def initialize(address = nil)
        @context = ZMQ::Context.new
        @address = address ? address : Rails.application.config.prediction_service_address
    end

    def predict(a)
        Timeout::timeout(SOCK_TIMEOUT) do
            if (!@sock.nil?)
                sock = @sock
            else
                sock = get_req_sock(@context, @address)
            end
            return false unless sock

            sock.send_string(format_message(a))
            res = ''
            sock.recv_string(res)
            sock.close if (@sock.nil?)
            res = JSON.parse(res)
            return res['label']
        end
    end

    def connect()
        Timeout::timeout(SOCK_TIMEOUT) do
            @sock = get_req_sock(@context, @address)
            return false unless @sock
        end
        true
    end

    def disconnect()
        @sock.close unless (@sock.nil?)
        @sock = nil
        true
    end

    def status
        begin
            Timeout::timeout(1) do
                sock = get_req_sock(@context, @address)
                return false unless sock

                sock.send_string('{"status":true}')
                status = ''
                rcv = sock.recv_string(status)
                sock.close

                return false unless rcv > 0

                status_object = JSON.parse(status)
                return false unless status_object

                @@status[@address] = [true, DateTime.now]
                return status_object
            end
        rescue Timeout::Error
            false
        end
    end

    def available?(force_check = false)
        return true if (!force_check && @@status[@address] && @@status[@address][0] && @@status[@address][1] > STATUS_KEEP_ALIVE.ago)
        return false if (!force_check && @@status[@address] && !@@status[@address][0] && @@status[@address][1] > STATUS_KEEP_ALIVE.ago)

        begin
            Timeout::timeout(3) do
                sock = get_req_sock(@context, @address)
                unless sock
                    @@status[@address] = [false, DateTime.now]
                    return false
                end

                sock.send_string('{"status":true}')
                rcv = sock.recv_string('')
                sock.close

                if (rcv > 0)
                    @@status[@address] = [true, DateTime.now]
                    return true
                else
                    @@status[@address] = [false, DateTime.now]
                    return false
                end
            end
        rescue Timeout::Error
            @@status[@address] = [false, DateTime.now]
            false
        end
    end

    def self.start_service(address = nil)
        Timeout::timeout(5) do
            system('../xyz/manager', 'start')
            sleep(2)
        end
        true
    end

    def self.stop_service
        Timeout::timeout(5) do
            system('../xyz/manager', 'stop')
        end
        true
    end

    private

    def format_message(a)
        return {a: a}.to_json
    end

    def get_req_sock(context, address)
        sock = context.socket(ZMQ::REQ)
        sock.setsockopt(ZMQ::LINGER, 0)
        conn = sock.connect(address)
        return false unless ZMQ::Util.resultcode_ok?(conn)
        return sock
    end

end

ruby 2.2.3 rails 4.2.10 rack 1.6.8

Gems are up to date. Removing Timeout didn't help. Linger did not make a difference. I'm just calling status. Moving the context to class-level didn't help.

chuckremes commented 6 years ago

@black-snow Is stop_service called at some point to clean up? If so, try adding @context.terminate there (or wherever you handle shutting down the service) and see if it hangs. If it does, then there is an open socket somewhere and the context is waiting for it to terminate. If it does not hang, then this isn't a ffi-rzmq issue.

chuckremes commented 6 years ago

@black-snow ping?

black-snow commented 6 years ago

@chuckremes Interesting, server hangs as soon as I call @context.terminate. The only call I make is status and as you can see in the above code the socket should be closed after and linger is zero, too ... There should not be any open socket.

Sockets are tcp, btw.

black-snow commented 6 years ago

This is where it hangs: https://github.com/chuckremes/ffi-rzmq/blob/master/lib/ffi-rzmq/context.rb#L85

I don't see a LibZMQ.zmq_ctx_term in ffi-rzmq-core/libzmq.rb - there's attach_function :zmq_term, [:pointer], :int, :blocking => true and attach_function :zmq_ctx_destroy, [:pointer], :int, :blocking => true

Damn, zmq_term and zmq_ctx_destroy also lead to a hang.

/edit: found zmq_ctx_term in libzmq4.rb

chuckremes commented 6 years ago

@black-snow You probably already double-checked this, but just in case here is the man page for zmq_ctx_term.

Note that it will block until TWO conditions are satisfied.

  1. All sockets have been closed with zmq_close. AND
  2. All zmq_send have completed or the have timed out due to the ZMQ_LINGER option.

I know you said that you set ZMQ_LINGER to 0. That means that as soon as zmq_close is called then any pending messages should be discarded and the socket will close.

If you are getting a hang on zmq_ctx_term then you still have at least one socket open. Double-check your code.

Debugging hint... for these situations I abuse $global_variables to track a counter. Whenever you create a socket, increment $SocketCount. When you close a socket, decrement $SocketCount. In your termination routine, print out the value of $SocketCount and make sure it is zero. If it isn't, you missed something.

black-snow commented 6 years ago

@chuckremes thanks for your support!

I indeed found a sock being left open. I don't quite see why but by double-checking that it's gone the server seems to run stable again. So thank you very much! (for poiting me to the obvious once again) Now I can easily plug in my micro-services :)

P.S.: Do you still suggest calling @context.terminate explicitly?

chuckremes commented 6 years ago

Glad to help. Yes, I suggest calling @context.terminate explicitly. If you leave it to the finalizer to call it, you are at the mercy of the Ruby runtime to decide when that should happen. Hangs can occur. If you are doing a sequential shutdown, handle the terminate yourself to take the guesswork out of it.