airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

ZK connectivity failure with multiple watchers leads to permanent failure #92

Closed minkovich closed 4 years ago

minkovich commented 7 years ago

Setup: 6 nerve service watchers on the same instance connected to the same ZK pool

How to reproduce:

  1. The instance has a problem connecting to ZK
  2. Nerve -> Nerve::Nerve: nerve: watcher service1 not alive; reaping and relaunching Nerve::ServiceWatcher: nerve: stopping service watch service1 Nerve::Nerve: nerve: could not reap service1, got #<Zookeeper::Exceptions::NotConnected: Zookeeper::Exceptions::NotConnected>
  3. This continues in a loop for each service watcher until nerve is restarted.

Actual problem: The problem is that in start() in zookeeper.rb there are no checks to see if the ZK connection is alive before re-using in.

jolynch commented 7 years ago

@minkovich what is your desired behavior here? I suppose that we would like it if Nerve threw out the bad cached connection and tried again?

If the cluster is just not reachable this would lead to a similar infinite retry loop, but perhaps crash-recover is sufficient here?

minkovich commented 7 years ago

@jolynch The efficient solution would be for nerve to throw away the bad connection, but honestly in this situation a crash recovery would also be equivalent since connectivity was already lost.

panchr commented 4 years ago

Closed because this was fixed in #113.