When load / network saturation causes zookeeper connection issues, nerve becomes flapping: even though service watcher is always healthy, nerve service reporter keeps reporting up over and over again, putting more load on zookeeper and making things worse
zk-ruby/zk as a high level wrapper doesn't wrap all the exceptions from the low-level binding zk-ruby/zookeeper, for example, Zookeeper::Exceptions::NotConnected
Looking at the documentation, it doesn't mentioned that native exception could be thrown by high level wrapper either.
Instead Zookeeper::Exceptions::NotConnected is treated as StandardError, which caused the thread that runs service watcher and reporter to be killed and restarted later. The restarted service watcher sets the health status to nil by default, and perform health check successfully, causing service reporter tries to report up again.
Fix
Treat Zookeeper::Exceptions::NotConnected the same as connection errors, which goes through re-try cycle with delay in between before killing and restarting service watch thread until reach upper limit
Issue
When load / network saturation causes zookeeper connection issues, nerve becomes flapping: even though service watcher is always healthy, nerve service reporter keeps reporting up over and over again, putting more load on zookeeper and making things worse
The symptom is very similar to what was reported before: https://github.com/airbnb/nerve/issues/92
Root Cause
zk-ruby/zk as a high level wrapper doesn't wrap all the exceptions from the low-level binding zk-ruby/zookeeper, for example, Zookeeper::Exceptions::NotConnected
Looking at the documentation, it doesn't mentioned that native exception could be thrown by high level wrapper either.
Therefore the current implementation does NOT handle native error Zookeeper::Exceptions::NotConnected at all, which is supposed be handled like high level errors, for example, OperationTimeOut and ConnectionLoss: https://github.com/airbnb/nerve/blob/master/lib/nerve/reporter/zookeeper.rb#L8
Instead Zookeeper::Exceptions::NotConnected is treated as StandardError, which caused the thread that runs service watcher and reporter to be killed and restarted later. The restarted service watcher sets the health status to nil by default, and perform health check successfully, causing service reporter tries to report up again.
Fix
Treat Zookeeper::Exceptions::NotConnected the same as
connection
errors, which goes through re-try cycle with delay in between before killing and restarting service watch thread until reach upper limit@Ramyak @Jason-Jian @austin-zhu