arachnys / cabot

Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty
MIT License
5.59k stars 594 forks source link

All checks green for minutes yet instances and services are still marked as failing? #611

Closed hartwork closed 6 years ago

hartwork commented 6 years ago

Hi!

Now that #610 is no longer keeping my HTTP checks red and all my checks are green for minutes, instances and services are still reported as failing and no matter where I acknowledge, pause, re-run, disable or re-enable: They don't go back to Passing. I did find #9, but that was fixed. How do services and instances normally return to Passing in Cabot?

Thanks, Sebastian

dbuxton commented 6 years ago

Are the update_service jobs getting queued? I haven't observed this behavior so guess it might be something to do with jobs getting stuck or having a big backlog. Would need more information to diagnose.

hartwork commented 6 years ago

A restart of all docker containers fixed the symptom. Not sure where it got stuck. The observed behavior showed me that: Manually running checks does not update related instances and services — I think that's a bug? — and there is no way to "see" that checks are not running without explicit suspicion or a second instance of Cabot for mutual monitoring. For noticing of checks not running maybe a display with "most recent check run m:ss minutes ago" somewhere could help. The manual-trigger thing seems more important (and easier to fix) to me, though. What do you think?

dbuxton commented 6 years ago

Services update asynchronously so that would explain the behaviour you saw.

hartwork commented 6 years ago

I see. I think I see a reason now why maybe you did want to do it asynchronously — that update action doesn't seem to scale well so I imagine it could take significantly longer in medium-size setups. Is that the reasoning for being asynchronous?

dbuxton commented 6 years ago

well it can potentially update a lot of different services/instances, and there's no real benefit to it being synchronous (except potentially in the scenario where you trigger manually)

hartwork commented 6 years ago

Alright, thanks.