airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

[Nerve] avoid zk reporter/watcher reaping on network disconnection (cont.) #99

Closed lap1817 closed 6 years ago

lap1817 commented 6 years ago

This PR is to follow the similar idea in the PR on Synapse (https://github.com/airbnb/synapse/pull/250).

The current logic in Nerve is that: if an error is raised from the reporter (e.g. zookeeper reporter) when ping?, report_up, report_down, the watcher thread will stop the reporter (so it tries to remove the znode). Then the Nerve main thread will reap the watcher and relaunch it.

This caused unnecessary Nerve status flipping while in short network disconnection. The ping? function on reporter is most frequently called by the watcher thread. E.g.,

Flipping nerve status triggers more Synapse read, which make the network saturation more likely to happen.

Changes

Test

Rspec added/updated Tested on mango-test with simulated packet loss

Reviewers: @igor47 @jolynch @juchem @ken-experiment

lap1817 commented 6 years ago

@juchem , I added the logic to bubble up the connection errors when they happened too many. PTAL

lap1817 commented 6 years ago

@igor47 , PTAL. Rspec are updated to address your previous comments on using context

lap1817 commented 6 years ago

@igor47 , addressed the comments. please take a look

lap1817 commented 6 years ago

@igor47 , PTAL

igor47 commented 6 years ago

the change and accompanying specs look much better, thanks