The first one happened when an instance that zookeeper runs on reported running on degraded hardware and we replaced that instance and performed a rolling restart.
The second one happened when there's a network blip happened and it caused unstable connections to zookeeper.
Explanation
After digging into this, I found out this is a rare case that didn't get covered in the code.
Below I will use a simplified workflow to explain this.
In file lib/nerve/service_watcher.rb, nerve is reporting health status in this way.
until <some condition>
check_and_report
end
In check_and_report, we firstly check connection to zk then check service status, and finally report it to zk in the way below:
if <ping zk fails>
@was_up = nil
end
is_up = <check service status>
if is_up != @was_up
<either report_up or report_down>
end
@was_up = is_up
In the scenarios above, where we had bad zk node at some point but it came back after a while, the workflow above looks like this:
zk node is down, so the previous pings all failed, @was_up is nil, is_up is true
at some point after @was_up = is_up, the bad zk node came back, now @was_up is true, and is_up is true.
since the health status never got reported to zk and now if is_up != @was_up will never be met since they're both true now. That means this health status is never reported.
Proposal
If ping to zk fails, set @was_up to false and return immediately. This way it will skip checking service health status as well as attempting to report either up or down. Because if nerve couldn't connect to zk, then it makes no sense to even try.
The logic stays almost the same, but slightly simpler.
If ping to zk fails, then @was_up is set to false and return
previously @was_up is set to is_up, which could be either true or false but will not be reported. And then it will come back and repeat this process.
If ping succeeds, then @was_up stays as it is (assuming it's false) at the beginning. And is_up is still doing its job,
if it's true, then is_up != @was_up and it will be reported
if it's false, then it will do nothing according to the current logic until it succeeds.
Try to fix the "stuck" issue.
Background
We bumped into this twice recently.
The first one happened when an instance that zookeeper runs on reported running on degraded hardware and we replaced that instance and performed a rolling restart.
The second one happened when there's a network blip happened and it caused unstable connections to zookeeper.
Explanation
After digging into this, I found out this is a rare case that didn't get covered in the code.
Below I will use a simplified workflow to explain this.
In file
lib/nerve/service_watcher.rb
, nerve is reporting health status in this way.In
check_and_report
, we firstly check connection tozk
then check service status, and finally report it to zk in the way below:In the scenarios above, where we had bad
zk
node at some point but it came back after a while, the workflow above looks like this:@was_up
isnil
,is_up
istrue
@was_up = is_up
, the bad zk node came back, now@was_up
istrue
, andis_up
istrue
.if is_up != @was_up
will never be met since they're bothtrue
now. That means this health status is never reported.Proposal
If ping to
zk
fails, set@was_up
tofalse
and return immediately. This way it will skip checking service health status as well as attempting to report either up or down. Because if nerve couldn't connect tozk
, then it makes no sense to even try.The logic stays almost the same, but slightly simpler.
zk
fails, then@was_up
is set tofalse
and return@was_up
is set tois_up
, which could be eithertrue
orfalse
but will not be reported. And then it will come back and repeat this process.@was_up
stays as it is (assuming it'sfalse
) at the beginning. Andis_up
is still doing its job,true
, thenis_up != @was_up
and it will be reportedfalse
, then it will do nothing according to the current logic until it succeeds.Test
Review
@juchem @jolynch @Jason-Jian @igor47 @darnaut