Keeping the state ordinal as the value turns out to have a valid use …

bf2fc6cc711aee1a0c2a / kas-fleetshard

The kas-fleetshard-operator is responsible for provisioning and managing instances of kafka on a cluster. The kas-fleetshard-synchronizer synchronizes the state of a fleet shard with the kas-fleet-manager.

Apache License 2.0

7 stars 20 forks source link

Keeping the state ordinal as the value turns out to have a valid use … #747

Closed SamBarker closed 2 years ago

SamBarker commented 2 years ago

…case.

It would allow us to get a rough proxy for how long log recovery takes after a broker failure. Or how long a broker takes to reach quiescence on shutdown.

k-wall commented 2 years ago

+1 to the use-case you mention. I definitely see utility in those.

I'm curious why having the value helps and why the label isn't sufficient?

SamBarker commented 2 years ago

Labels are great for identifying what state it's in but the value would allow us to track how long it spent in each. I'm not aware of a way to plot that using just the label

k-wall commented 2 years ago

Ok, I had in my mind a technique like this (which relies on the 1 value) but I realize that would tell you the amount of time spent in, say recovery, in the last N hours and not the length of time spent doing the last recovery, which is what I think you have in mind. I spent some time thinking for a solution, but couldn't find a way to express it with PromQL (which seem strange I couldn't express 'for how long has this expression most recently yielded 1').

Out of interest, once the metric has its value again, what the promql do you intend to use?

SamBarker commented 2 years ago

I didn't have a specific query in mind but just the ability to plot the value and thus get an eyeball view of when the state transitions happen

k-wall commented 2 years ago

I didn't have a specific query in mind but just the ability to plot the value and thus get an eyeball view of when the state transitions happen

Changing track for a moment, have you considered a https://grafana.com/docs/grafana/latest/visualizations/state-timeline/ to illustrate the state of the broker? I think that would give the user a good impression of how long the broker had remained in a state. I think that would work with #740 (untested).

SamBarker commented 2 years ago

The state timeline looks ideal, and could possibly work with a fixed value in each series I'll go experiment.

SamBarker commented 2 years ago

A couple of notes from testing state-timeline, they are only available in Grafana 8+, by default has-installer deploys grafana v7.3.2. AFICT there is no way to make a state timeline plot chart the changes between labels it needs the value of the metric to do its thing.

showuon commented 2 years ago

I'm wondering if the LogManager could expose this too? @showuon WDYT?

I can add that in KIP-831 if we think this is important. But that needs another round of vote and maybe discussion again.

The problem with trying to infer it from period observations of the broker state is that it will be subject by up to 2 * scrape interval inaccuracy.

I'm wondering will that make a big difference? If no, I think by inferring from the observation could already give us a rough idea how long one state stays. WDYT?

k-wall commented 2 years ago

I suppose it comes down to how important have accurate metrics about how long the broker is spending in each state to the service. I guess we should apply YAGNI until we know different and accept that the rough idea given from broker state observations are sufficient. If it turns out it is interesting, we create a KIP then.

SamBarker commented 2 years ago

I think the broker state machine is the more important metric anyway. We know recovery is a rough proxy for log recovery, but the important thing for the service is when the broker transitions from recovery to running.

I would still be interested in the detailed timing but I suspect that we don't really care about the loss precision 2 * scrape_interval implies