InVisionApp / go-health

Library for enabling asynchronous health checks in your service
MIT License
750 stars 58 forks source link

Fix global failed status when multiple fatal checks have mixed results #50

Closed endorama closed 6 years ago

endorama commented 6 years ago

Hello, first of all thank you for this library!

I really like the effective and simple implementation.

I found a bug that I suspect is a race condition in the global status when there are multiple fatal checks with mixed order.

Given the presence of multiple fatal checks, some failing and some passing, if the order of execution makes the passing check end last the global status was reported as ok even if other fatal check were in a failed state.

Looking at the code, each goroutine was updating the h.failed "thread-global" variable at the end of it's check cycle. This makes the h.failed variable to have the last value set by a goroutine in order of exectuion time. If a successful fatal test run after a failing one, the h.failed variable was reporting true.

To reproduce the bug I updated the test cases

TestFailed/
  Should_return_false_if_a_fatally_configured_check_hasn't_errored

and

TestState/
  When_a_fatally-configured_check_fails_and_recovers,_state_should_get_updated_accordingly

to use two checkers instead of one, both fatal but one failing and one passing, ordered such that the passing check was executed after the failing one.
( You can still test this at commit 1fa23aa )

This PR adds a little feature and implements a fix for this behaviour.

Feature

Add Fatal field in the State struct.
This information is really helpful when multiple checks (mixed fatal and not fatal) are present and the global status is failed.
With this information exposed is clear which check is making the global status to fail.

The bugfix is based upon the Fatal field exposed in the State struct, this could be reworked but I found the Fatal to be useful there anyway.

Bugfix

Remove the h.failed boolean and rely only on State information to evaluate the condition.
With this implementation the h.Failed() function relies on safeGetStates to reported last known status.

Thank you for looking into this!

codecov[bot] commented 6 years ago

Codecov Report

Merging #50 into master will decrease coverage by 1.16%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #50      +/-   ##
==========================================
- Coverage     100%   98.83%   -1.17%     
==========================================
  Files           6        6              
  Lines         343      344       +1     
==========================================
- Hits          343      340       -3     
- Misses          0        4       +4
Impacted Files Coverage Δ
health.go 100% <100%> (ø) :arrow_up:
safe.go 81.81% <0%> (-18.19%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e3410a5...aefe268. Read the comment docs.

dselans commented 6 years ago

Thanks for your contribution!

On second look, this is pretty obvious... thanks for fixing this bug :)

It's unfortunate that we have to traverse the entire map to determine whether the healthcheck has fatally failed, but it's certainly better than what we have right now.

Thanks again!

endorama commented 6 years ago

Thank you for merging this!