Discovered while doing a general browse of some logfiles:
I, [2014-01-03T20:29:35.954755 #2900] INFO -- Nerve::Nerve: nerve: starting up!
I, [2014-01-03T20:29:35.954903 #2900] INFO -- Nerve::Nerve: nerve: starting run
I, [2014-01-03T20:29:35.955280 #2900] INFO -- Nerve::ServiceWatcher: nerve: starting service watch ministry-worker
I, [2014-01-03T20:29:35.955376 #2900] INFO -- Nerve::Reporter: nerve: waiting to connect to zookeeper at 1.zookeeper.qa1.climate.net:2181,0.zookeeper.qa1.climate.net:2181,2.zookeeper.qa1.climate.net:2181/nerve/services/qa1/ministry-worker/default
I, [2014-01-03T20:29:36.015104 #2900] INFO -- Nerve::Reporter: nerve: successfully created zk connection to 1.zookeeper.qa1.climate.net:2181,0.zookeeper.qa1.climate.net:2181,2.zookeeper.qa1.climate.net:2181/nerve/services/qa1/ministry-worker/default
I, [2014-01-03T20:29:36.020927 #2900] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-i-f76498d9.us-east-1.qa1.climate.net:13050/ initial check returned false
I, [2014-01-03T20:30:08.431404 #2900] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-i-f76498d9.us-east-1.qa1.climate.net:13050/ transitions to up after 3 successes
I, [2014-01-03T20:30:08.439403 #2900] INFO -- Nerve::ServiceWatcher: nerve: service ministry-worker is now up
E, [2014-01-03T20:44:13.056572 #2900] ERROR -- Nerve::ServiceWatcher: nerve: error in service watcher ministry-worker: response for meth: :exists, args: [425, "/", nil, nil], not received within 30 seconds
I, [2014-01-03T20:44:13.056680 #2900] INFO -- Nerve::ServiceWatcher: nerve: ending service watch ministry-worker
...and then nerve apparently stops probing the service state for as long as the nerve process remains up. Digging into this, the issue seems to be some sort of hard-coded timeout in the zookeeper gem, in continuation.rb:
107 @mutex.synchronize do
108 while true
109 now = Time.now
110 break if @rval or @error or (now > time_to_stop)
111
112 deadline = time_to_stop.to_f - now.to_f
113 @cond.wait(deadline)
114 end
115
116 if (now > time_to_stop) and !@rval and !@error
117 raise Exceptions::ContinuationTimeoutError, "response for meth: #{meth.inspect}, args: #{@args.inspect}, not received within #{OPERATION_TIMEOU 117 T} seconds"
118 end
119
120 case @error
121 when nil
122 # ok, nothing to see here, carry on
123 when :shutdown
124 raise Exceptions::NotConnected, "the connection is shutting down"
125 when ZOO_EXPIRED_SESSION_STATE
126 raise Exceptions::SessionExpired, "connection has expired"
127 else
128 raise Exceptions::NotConnected, "connection state is #{STATE_NAMES[@error]}"
129 end
130
131 case @rval.length
132 when 1
133 return @rval.first
134 else
135 return @rval
136 end
137 end
138 end
The good news, such as it is, is that if this happens while the node is healthy, nerve will "fail open" -- the node will continue to be registered in ZK, and synapse clients will still see it and route traffic to it if it passes /ping. The bad news is that if this happens while the node is unhealthy, it will stay that way forever from synapse's perspective, until the nerve process is manually restarted.
I am playing with a code change to simply have the nerve process abort if any of the service watchers die in this fashion (we run nerve under supervise here, so it would automatically be restarted in that case), but would like some feedback on what you think the proper way to handle this before submitting a pull request.
Discovered while doing a general browse of some logfiles:
...and then nerve apparently stops probing the service state for as long as the nerve process remains up. Digging into this, the issue seems to be some sort of hard-coded timeout in the zookeeper gem, in continuation.rb:
The good news, such as it is, is that if this happens while the node is healthy, nerve will "fail open" -- the node will continue to be registered in ZK, and synapse clients will still see it and route traffic to it if it passes /ping. The bad news is that if this happens while the node is unhealthy, it will stay that way forever from synapse's perspective, until the nerve process is manually restarted.
I am playing with a code change to simply have the nerve process abort if any of the service watchers die in this fashion (we run nerve under supervise here, so it would automatically be restarted in that case), but would like some feedback on what you think the proper way to handle this before submitting a pull request.