airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

Gracefully handle Zookeeper::Exceptions::NotConnected #113

Closed anson627 closed 5 years ago

anson627 commented 5 years ago

Issue

When load / network saturation causes zookeeper connection issues, nerve becomes flapping: even though service watcher is always healthy, nerve service reporter keeps reporting up over and over again, putting more load on zookeeper and making things worse

The symptom is very similar to what was reported before: https://github.com/airbnb/nerve/issues/92

Root Cause

zk-ruby/zk as a high level wrapper doesn't wrap all the exceptions from the low-level binding zk-ruby/zookeeper, for example, Zookeeper::Exceptions::NotConnected

Looking at the documentation, it doesn't mentioned that native exception could be thrown by high level wrapper either.

Therefore the current implementation does NOT handle native error Zookeeper::Exceptions::NotConnected at all, which is supposed be handled like high level errors, for example, OperationTimeOut and ConnectionLoss: https://github.com/airbnb/nerve/blob/master/lib/nerve/reporter/zookeeper.rb#L8

Instead Zookeeper::Exceptions::NotConnected is treated as StandardError, which caused the thread that runs service watcher and reporter to be killed and restarted later. The restarted service watcher sets the health status to nil by default, and perform health check successfully, causing service reporter tries to report up again.

Fix

Treat Zookeeper::Exceptions::NotConnected the same as connection errors, which goes through re-try cycle with delay in between before killing and restarting service watch thread until reach upper limit

@Ramyak @Jason-Jian @austin-zhu