Kibana should prioritize `/api/status` and `/api/stats`

afharo commented 10 months ago

When Kibana is loaded high enough, the metrics collected via the APIs /api/status and /api/stats may never respond in time. Metricbeat applies a timeout equal to the collection interval (default 10s). This means, when loaded, ELU-or-Memory-wise, these requests may be queued with the rest and could miss the time out.

Since the stability of our platform depends on the responses of these APIs, we need to figure out how to prioritize these APIs over the rest so that their responses come on time.

elasticmachine commented 10 months ago

Pinging @elastic/kibana-core (Team:Core)

pgayvallet commented 10 months ago

Is event-loop priorization even something remotely technically doable 🤔 ? (unless you had another approach in mind?)

afharo commented 10 months ago

@rudolf suggested we may want to reject other requests when we identify Kibana is struggling (so that we make room for any potential incoming requests /api/status and /api/stats). I guess the question is: how do we know we are struggling to the point that we shouldn't accept more requests?

I noticed Hapi offers server.options.load to stop responding when either ELU, EL-Delay, heap, or RSS are over a threshold: https://hapi.dev/api/?v=21.3.2#-serveroptionsload

However, it's a configuration for the entire server: I don't think it allows applying it for specific routes (or vice-versa). Looking at their implementation, though, it looks like they add a pre interceptor that runs heavy.check() (where heavy is a monitor created with https://github.com/hapijs/heavy).

We may be able to replicate that but also validate the requested path. We already have our own OpsMetrics, so we can use that instead of the heavy library to avoid yet another monitor for those metrics.

Finding the "magic numbers" would be difficult, though. I'd vote for Event-loop delays (as ELU being 1 doesn't mean we are really struggling to the point where circuit breakers are needed), but high delays have caused many problems in the past. Aside from delaying our capacity to reply on time to those short-timed-out status/stats APIs.

WDYT?

rudolf commented 10 months ago

It's important that the circuit breaker doesn't kick in without us signaling that we want to autoscale. So I think a first iteration needs to use an ELU threshold that's higher than the autoscaling threshold.

Because our metrics are always backwards looking we would never be able to guarantee that e.g. /api/status will respond timely so this is a best-effort protection.

jloleysens commented 3 months ago

It seems this work will form a natural candidate for our exploration into circuit breakers (specifically rate limiting) of Kibana requests.

without us signaling that we want to autoscale...first iteration needs to use an ELU threshold that's higher than the autoscaling threshold

IMO we should leverage the existing mechanism of relying on our actual resource usage to be the signal and should avoid trying to do something custom. Thus this first iteration sounds "good enough" to me.

respond timely

We are planning on revisiting use of /api/status to make it even more light weight which could help with this, https://github.com/elastic/kibana/issues/184503. Although nothing will really help an event loop that is tied up.

elastic / kibana

Kibana should prioritize `/api/status` and `/api/stats` #170132