ZK connectivity failure with multiple watchers leads to permanent failure

minkovich commented 7 years ago

Setup: 6 nerve service watchers on the same instance connected to the same ZK pool

How to reproduce:

The instance has a problem connecting to ZK
Nerve -> Nerve::Nerve: nerve: watcher service1 not alive; reaping and relaunching Nerve::ServiceWatcher: nerve: stopping service watch service1 Nerve::Nerve: nerve: could not reap service1, got #<Zookeeper::Exceptions::NotConnected: Zookeeper::Exceptions::NotConnected>
This continues in a loop for each service watcher until nerve is restarted.

Actual problem: The problem is that in start() in zookeeper.rb there are no checks to see if the ZK connection is alive before re-using in.

jolynch commented 7 years ago

@minkovich what is your desired behavior here? I suppose that we would like it if Nerve threw out the bad cached connection and tried again?

If the cluster is just not reachable this would lead to a similar infinite retry loop, but perhaps crash-recover is sufficient here?

minkovich commented 7 years ago

@jolynch The efficient solution would be for nerve to throw away the bad connection, but honestly in this situation a crash recovery would also be equivalent since connectivity was already lost.

panchr commented 4 years ago

Closed because this was fixed in #113.

airbnb / nerve

ZK connectivity failure with multiple watchers leads to permanent failure #92