Skip checking service status if pinging zk fails

Try to fix the "stuck" issue.

Background

We bumped into this twice recently.

The first one happened when an instance that zookeeper runs on reported running on degraded hardware and we replaced that instance and performed a rolling restart.

The second one happened when there's a network blip happened and it caused unstable connections to zookeeper.

Explanation

After digging into this, I found out this is a rare case that didn't get covered in the code.

Below I will use a simplified workflow to explain this.

In file lib/nerve/service_watcher.rb, nerve is reporting health status in this way.

until <some condition>
  check_and_report
end

In check_and_report, we firstly check connection to zk then check service status, and finally report it to zk in the way below:

if <ping zk fails>
  @was_up = nil
end

is_up = <check service status>

if is_up != @was_up
  <either report_up or report_down>
end
@was_up = is_up

In the scenarios above, where we had bad zk node at some point but it came back after a while, the workflow above looks like this:

zk node is down, so the previous pings all failed, @was_up is nil, is_up is true
at some point after @was_up = is_up, the bad zk node came back, now @was_up is true, and is_up is true.
since the health status never got reported to zk and now if is_up != @was_up will never be met since they're both true now. That means this health status is never reported.

Proposal

If ping to zk fails, set @was_up to false and return immediately. This way it will skip checking service health status as well as attempting to report either up or down. Because if nerve couldn't connect to zk, then it makes no sense to even try.

The logic stays almost the same, but slightly simpler.

If ping to zk fails, then @was_up is set to false and return
- previously @was_up is set to is_up, which could be either true or false but will not be reported. And then it will come back and repeat this process.
If ping succeeds, then @was_up stays as it is (assuming it's false) at the beginning. And is_up is still doing its job,
- if it's true, then is_up != @was_up and it will be reported
- if it's false, then it will do nothing according to the current logic until it succeeds.

Test

Fix unit test
Test using a box in test environment

Review

@juchem @jolynch @Jason-Jian @igor47 @darnaut

airbnb / nerve