airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

[Nerve] avoid zk reporter/watcher reaping on network disconnection #96

Closed lap1817 closed 6 years ago

lap1817 commented 6 years ago

This PR is to follow the similar idea in the PR on Synapse (https://github.com/airbnb/synapse/pull/250).

The current logic in Nerve is that: if an error is raised from the reporter (e.g. zookeeper reporter) when ping?, report_up, report_down, the watcher thread will stop the reporter (so it tries to remove the znode). Then the Nerve main thread will reap the watcher and relaunch it.

This caused unnecessary Nerve status flipping while in short network disconnection. The ping? function on reporter is most frequently called by the watcher thread. E.g.,

Flipping nerve status triggers more Synapse read, which make the network saturation more likely to happen.

The fix is to let the reporter to swallow the error if it is connection error. The reporter is mostly controlled by the watcher thread to report up/down status. Thus:

Reviewers: @alexism @jolynch @darnaut

igor47 commented 6 years ago

specs that explain the expected behavior would be more clear than the long PR description. i would like spec cases that show what happens when a connection is totally lost and when it's briefly interrupted. the specs would set up a ZK object that returns the expected values/raise the expected exceptions, and then show that nerve behaves correctly under those conditions

lap1817 commented 6 years ago

@igor47 , rspec tests are added. PTAL

lap1817 commented 6 years ago

@igor47 @jolynch: please take a look. rspec tests have been updated to test the new behavior of zookeeper reporter

lap1817 commented 6 years ago

@igor47 @jolynch : can you take a look at this one. I have added new rspec tests to cover this change

juchem commented 6 years ago

LGTM just make sure to address the typo and check the two other comments

lap1817 commented 6 years ago

@igor47, I accidentally deleted my repo. Here is the re-created PR: https://github.com/airbnb/nerve/pull/99

I have addressed your comments on the tests. It also also some new change, which allows the connection errors to bubble up when it happened too many