Closed tbg closed 1 year ago
Monitoring usually cares about restarts. We monitor uptime because up=1
is flaky (a single failed scrape would consider the node as down). Instead, resets (aka drops) in the uptime counter are a good indicator that the process restarted.
We tend to care about two types of alerts:
It's rare to want alerts on node restarts themselves, you care more about problems due to unexpected restarts (or even planned restarts such as rolling upgrades). Especially when it comes to large deployment (where machines/switches/racks/etc... will cause too much noise) you tend to ignore individual restarts/crashes.
That said, we want to know about individual crashes and investigate/fix them. We currently do this by not auto-restarting nodes (except on register
). This triggers the NodeDown alert after some amount of time but ignores restarts due to rolling upgrades.
We could switch to a mechanism that can tell whether a node shut down cleanly or not. This doesn't need to be a timestamp, just a last exit code
metric. To know the last time a process crashed, just search for the last transition to a non-zero exit code. Graphana can do all the logic you want to combine it with each run's uptime.
One issue is that any too-slow shutdown will get killed by supervisor and result in a non-zero exit code even if we told cockroach to shut down. That might be acceptable though.
I agree that for large deployments unplanned restarts do happen, but many deployments are really small and want to know about crashes (and we definitely want to for our test clusters).
I'm not sure how you would implement you "last exit code" gauge. There's no way to write that exit code in general, so how would that metric work? I'm not eager to run a sidecar process just for that (and then that might crash, too). You can just pretend there's an exit code "1" whenever you didn't write a clean shutdown marker, but then you're really just persisting a boolean. The "crash-free uptime" seems implementable with the same amount of effort and more useful.
I probably won't have time to do this for 2.1. I'm also not convinced of its utility or best solution.
@piyush-singh @thtruo is this issue stil current? it did seem a good idea at the time, but I don't recall we have a lot of demand for this, or do we?
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
When looking at monitoring, we use uptime as an indicator of whether nodes crashed. However, sometimes nodes get restarted intentionally (or accidentally, but at least cleanly).
To quickly distinguish the two, we could introduce a metric that reports "crash-free time". This means the following:
time.Duration
to a store-local key which records the crash-free uptime.This seems easy to do and I personally expect it to save us (and others) time.
cc @mberhault mostly because I think you have an informed opinion here. I know that in an ideal world you (or someone) gets notified when crashes occur, but that is a lot more awkward to set up. In the absence of visual event indicators in the ui (and graphana? Fat chance) the crash-free uptime seems useful.
Jira issue: CRDB-4969