airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

Skip checking service status if pinging zk fails #105

Closed ken-experiment-zz closed 5 years ago

ken-experiment-zz commented 5 years ago

Try to fix the "stuck" issue.

Background

We bumped into this twice recently.

The first one happened when an instance that zookeeper runs on reported running on degraded hardware and we replaced that instance and performed a rolling restart.

The second one happened when there's a network blip happened and it caused unstable connections to zookeeper.

Explanation

After digging into this, I found out this is a rare case that didn't get covered in the code.

Below I will use a simplified workflow to explain this.

In file lib/nerve/service_watcher.rb, nerve is reporting health status in this way.

until <some condition>
  check_and_report
end

In check_and_report, we firstly check connection to zk then check service status, and finally report it to zk in the way below:

if <ping zk fails>
  @was_up = nil
end

is_up = <check service status>

if is_up != @was_up
  <either report_up or report_down>
end
@was_up = is_up

In the scenarios above, where we had bad zk node at some point but it came back after a while, the workflow above looks like this:

Proposal

If ping to zk fails, set @was_up to false and return immediately. This way it will skip checking service health status as well as attempting to report either up or down. Because if nerve couldn't connect to zk, then it makes no sense to even try.

The logic stays almost the same, but slightly simpler.

Test

Review

@juchem @jolynch @Jason-Jian @igor47 @darnaut