Improve Core's `elasticsearch` service status computation.

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.62k stars 8.22k forks source link

Improve Core's `elasticsearch` service status computation. #170294

Open pgayvallet opened 1 year ago

pgayvallet commented 1 year ago

At the moment, the elasticsearch status logic is fairly simple:

start with unavailable
at a fixed interval, perform a call to nodes.info ( from pollEsNodesVersion)
- if the call fails for any reason (e.g. connectivity issues) -> critical
- if at least one node is not compatible -> critical
- otherwise -> available

Meaning that atm the service's status is "red or green": either everything is fine (avalable) or nothing is (critical)

Ideally, the quality of the connection between Kibana and ES should be reflected in this status. For example, I would expect the status to depend of the actual usage of the ES client instead of being exclusively based on health checks. That way, we could think of something like:

repeated timeouts in a given period of time would switch to service's status to degraded
more critical connectivities issues, such as connection refused error, should switch the service's status to critical and/or unavailable

Of course, the major functional challenge would be in putting the cursor at the right level. We probably don't want to switch the service to degraded on a single timeout, but we probably don't want to wait until 100% of our requests to timeout before switching our status. Same question about deciding when things are "fine again".

On the technical side, we also need to find how we can properly listen to all requests to retrieve this information. Maybe this needs to be done at the transport level.

Tech pointers

https://github.com/elastic/kibana/blob/35083504464a4950d5ce5c6464cab387dd98d0d8/packages/core/elasticsearch/core-elasticsearch-server-internal/src/version_check/ensure_es_version.ts#L147-L172

https://github.com/elastic/kibana/blob/12466d8b17d8557ff0b561c346511bd1760da4c1/packages/core/elasticsearch/core-elasticsearch-server-internal/src/status.ts#L15-L71

https://github.com/elastic/kibana/blob/12466d8b17d8557ff0b561c346511bd1760da4c1/packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/create_transport.ts#L26-L86

elasticmachine commented 1 year ago

Pinging @elastic/kibana-core (Team:Core)

afharo commented 1 year ago

Meaning that we're not taking into account any problem that can occur in Kibana->ES communications in this status.

As far as I could tell from the logs today (I may need to test it locally), the polling for version also errors when there's a connection issue, setting the service as critical.

elasticsearch service is now critical: Unable to retrieve version information from Elasticsearch nodes. getaddrinfo ENOTFOUND fdea58af242b4ddb9bd7671aa4a4bcf4.es.us-east-1.aws.internal.elastic.cloud

pgayvallet commented 1 year ago

You're actually right, I totally missed it, but we're switching the service's status to critical if any error occurs during the healthcheck, until an healthcheck effectively succeed.

https://github.com/elastic/kibana/blob/35083504464a4950d5ce5c6464cab387dd98d0d8/packages/core/elasticsearch/core-elasticsearch-server-internal/src/version_check/ensure_es_version.ts#L69-L72

https://github.com/elastic/kibana/blob/12466d8b17d8557ff0b561c346511bd1760da4c1/packages/core/elasticsearch/core-elasticsearch-server-internal/src/status.ts#L36-L41

I updated the issue's description accordingly.

It probably also changes the scope of the issue to discussion more than enhancement 😅

sophiec20 commented 11 months ago

fwiw, please be aware of the big investments made by elasticsearch to its health API in recent months. This may shape how we approach this problem.

pgayvallet commented 10 months ago

I think the first thing we need to do here is defining the concept of "elasticsearch status" from Kibana's standpoint we want.

When the Kibana service status API was designed, the concept of "service status" was supposed to be used to report the actual status of our services. This is especially important for the elasticsearch service, as the idea was basically to report whether or not the service was able function, so to communicate with the ES cluster, and not the status of the cluster itself. It might have been fairly naive, but that's what the thinking was at the time.

Which is why the current status of this service is based simply pinging the cluster at a fixed interval, and reporting:

ok if the cluster responded (and all nodes are matching the expected version)
critical we didn't get a response for any reason (or if any node doesn't match the expected version)

The point is, is was supposed to be reporting the status of our service. If ES is unhealthy for any reason, it wasn't supposed to be our concern, as long as we can reach and communicate with it.

So before discussing the technicals of how we could be improving this service's status reporting, we need to decide if we want to stick to that decision or if we want to revisit it.

Are we only interested in knowing about the overall connectivity status to ES? (ES status "from Kibana standpoint")
- in which case we can focus on improvements using Kibana as the observer (e.g by computing a status from the actual traffic/usage of the elasticsearch client)
Or do we think we should have our status more directly depends on ES's status as reported by ES itself (via it's health API, as suggested, for example)
- and if so, should we take it as the single source of truth, or do we want to also have our own observation of the status?

Having Kibana's "ES status" depends on health information directly reported by ES may be more reliable (and probably even easier), and would also avoid "desyncs" between Kibana nodes (as a status computed from the actual usage would vary from one Kibana node to another in a multi-node setup), but could potentially miss some env/infra related issues (e.g bad networking between Kibana and ES...)

As a side note, regarding the discussions driven by @lukeelmers in parallel regarding the need to more easily surface ES's status to plugins, especially on the browser-side, to get to a better user experience, I still wonder if we're not mixing two things that should remain dissociated here. Maybe we should just keep our "elastic status as seen from Kibana" for our status API, and develop something in parallel if solution teams needs to know about the actual/concrete state of the ES cluster itself?

Especially interested in what @rudolf and @lukeelmers think about it.

lukeelmers commented 10 months ago

Maybe we should just keep our "elastic status as seen from Kibana" for our status API, and develop something in parallel if solution teams needs to know about the actual/concrete state of the ES cluster itself?

Thanks for getting the conversation going @pgayvallet. I'm not sure if there's a need for teams to know the overall state of the ES cluster in the browser, and I tend to agree that's a separate problem. I think the main issue we are trying to solve in the UX is making it so Kibana doesn't appear broken when ES is broken. One thing that would help with this is having messaging in the UI to make it clear to users when an issue they are encountering actually has nothing to do with Kibana, but rather something that's going on with ES.

This means that in practice, the types of issues we'd want to report are likely things that would fall into the category of elastic status as seen from Kibana. However, I wouldn't say that rules out optionally consuming ES health APIs on our side to inform the decision... it just doesn't seem to me we'd want to surface those results as-is directly to the UI.

One way I could see us doing this iteratively is:

Make status service "smarter" in how it analyzes ES health based on what Kibana is able to observe (+ maybe consuming ES health APIs as one of the data points)
Expose service to the browser so UIs can incorporate into error handling/messaging
If this doesn't sufficiently improve UX, we later consider alternate solutions for exposing a more generic "state of ES" to plugins

rudolf commented 10 months ago

I agree that these are fundamentally two different problems even if technically we might decide to solve it from the same service. I tried to formulate the problems below, but I'm still struggling to exactly articulate it.

Help server-side plugins change their behaviour depending on the connection to and/or health of Elasticsearch I can imagine several kinds of behaviour changes such as:
1. If there's no connection to ES, don't bother to try sending tons of requests https://github.com/elastic/kibana/issues/170053 One downside is that a single version check fail would then cause all of Kibana to stop doing work for the next 2500ms. I also don't know how much this really helps. While it's wasteful to send 100 requests that would definitely fail, it's not necessarily a problem that would affect users or Kibana's reliability.
2. Improve quality of service by stopping all "low priority background work" if Elasticsearch appears to be under pressure. If Elasticsearch has high memory pressure or thread pool queue sizes we don't want to bombard it with e.g. telemetry collection requests. Maybe we also want to reduce the rate at which ZDT migrations are processing batches to give ES some room to breath etc. The biggest challenge would probably be identifying all such low priority work across all plugins so that this actually makes a tangible difference.
When users are having a degraded experience, how do we help them understand the root cause of the problem (and can we help them address it?) Providing the ES health in the browser would be low effort, but it's not clear to me how plugins would use this in their errors. Aren't the ES errors already descriptive enough of the problem? If there are shard failures or circuit breaking exceptions would the cluster health help the user? What kind of ES degraded user experiences are most common, do we want to account for ES responding with "success" but having slow response times or are we only concerned about addressing explicit failures? Would an "Elasticsearch health" icon next to the "Help" and "What's new" icons in the header bar be useful?