I've noticed across a couple clusters that Kibana can end up in a degraded status due to capacity_estimation which really sources in informatively from high runtime > drift usually drift_by_type of alerting:* (aka. Expensive Rules).
The (I really feel is more) bug or (could be labelled instead as) FR I have is that even if drift is p50 backed up by 3mins usu. with load.p50: 100 then runtime still reports status: OK. Can we put some logic in there to flip this to warn/error at some point?
Example
I've dealt with this situation with a couple of users, most egregious situations have been air-gapped so I can't share those examples. However, sharing a low-medium example output in full:
[A]
I wrote an automation to root-cause problematic plugin so reports:
My report automation goes on, but pivoting towards applicability for this Github, e.g. doc: Evaluate the Runtime quotes section
Theory: Kibana is polling as frequently as it should, but that isnโt often enough to keep up with the workload
...
For details on achieving higher throughput by adjusting your scaling strategy, see Scaling guidance.
In our example(s) the load compared to this example doc section is instead actually p50: 100 and drifted by >1min. In a recent air-gapped example (not represented just below) it was >3min drifted:
So overall, it makes sense that this drift+load cascades into capacity_estimation messages since that's where the docs point. However for API response interpretation/usability or diagnostic automations, it doesn't really make sense that runtime didn't flag as status: warn or something more problematic since the root-cause of the problem was something inside runtime cascaded into capacity_estimation.
Request
Unknown literal values but some logic like
IF runtime.drift.p50 > 60000 then runtime.status: warn.
IF runtime.load.p50: 100 then runtime.status: error
Summary
๐๐ผ howdy, team!
I've noticed across a couple clusters that Kibana can end up in a degraded status due to
capacity_estimation
which really sources in informatively from highruntime
>drift
usuallydrift_by_type
ofalerting:*
(aka. Expensive Rules).The (I really feel is more) bug or (could be labelled instead as) FR I have is that even if
drift
isp50
backed up by 3mins usu. withload.p50: 100
thenruntime
still reportsstatus: OK
. Can we put some logic in there to flip this towarn
/error
at some point?Example
I've dealt with this situation with a couple of users, most egregious situations have been air-gapped so I can't share those examples. However, sharing a low-medium example output in full:
[A]
I wrote an automation to root-cause problematic plugin so reports:
My report automation goes on, but pivoting towards applicability for this Github, e.g. doc: Evaluate the Runtime quotes section
In our example(s) the load compared to this example doc section is instead actually
p50: 100
and drifted by >1min. In a recent air-gapped example (not represented just below) it was >3min drifted:So overall, it makes sense that this drift+load cascades into
capacity_estimation
messages since that's where the docs point. However for API response interpretation/usability or diagnostic automations, it doesn't really make sense thatruntime
didn't flag asstatus: warn
or something more problematic since the root-cause of the problem was something insideruntime
cascaded intocapacity_estimation
.Request
Unknown literal values but some logic like
๐๐ผ