Closed SamBarker closed 2 years ago
+1 to the use-case you mention. I definitely see utility in those.
I'm curious why having the value helps and why the label isn't sufficient?
Labels are great for identifying what state it's in but the value would allow us to track how long it spent in each. I'm not aware of a way to plot that using just the label
Ok, I had in my mind a technique like this (which relies on the 1 value) but I realize that would tell you the amount of time spent in, say recovery, in the last N hours and not the length of time spent doing the last recovery, which is what I think you have in mind. I spent some time thinking for a solution, but couldn't find a way to express it with PromQL (which seem strange I couldn't express 'for how long has this expression most recently yielded 1').
Out of interest, once the metric has its value again, what the promql do you intend to use?
I didn't have a specific query in mind but just the ability to plot the value and thus get an eyeball view of when the state transitions happen
I didn't have a specific query in mind but just the ability to plot the value and thus get an eyeball view of when the state transitions happen
Changing track for a moment, have you considered a https://grafana.com/docs/grafana/latest/visualizations/state-timeline/ to illustrate the state of the broker? I think that would give the user a good impression of how long the broker had remained in a state. I think that would work with #740 (untested).
The state timeline looks ideal, and could possibly work with a fixed value in each series I'll go experiment.
A couple of notes from testing state-timeline, they are only available in Grafana 8+, by default has-installer deploys grafana v7.3.2. AFICT there is no way to make a state timeline plot chart the changes between labels it needs the value of the metric to do its thing.
I'm wondering if the LogManager could expose this too? @showuon WDYT?
I can add that in KIP-831 if we think this is important. But that needs another round of vote and maybe discussion again.
The problem with trying to infer it from period observations of the broker state is that it will be subject by up to 2 * scrape interval inaccuracy.
I'm wondering will that make a big difference? If no, I think by inferring from the observation could already give us a rough idea how long one state stays. WDYT?
I suppose it comes down to how important have accurate metrics about how long the broker is spending in each state to the service. I guess we should apply YAGNI until we know different and accept that the rough idea given from broker state observations are sufficient. If it turns out it is interesting, we create a KIP then.
I think the broker state machine is the more important metric anyway. We know recovery
is a rough proxy for log recovery, but the important thing for the service is when the broker transitions from recovery
to running
.
I would still be interested in the detailed timing but I suspect that we don't really care about the loss precision 2 * scrape_interval
implies
…case.
It would allow us to get a rough proxy for how long log recovery takes after a broker failure. Or how long a broker takes to reach quiescence on shutdown.