Open pgayvallet opened 1 year ago
Pinging @elastic/kibana-core (Team:Core)
Meaning that we're not taking into account any problem that can occur in Kibana->ES communications in this status.
As far as I could tell from the logs today (I may need to test it locally), the polling for version also errors when there's a connection issue, setting the service as critical
.
elasticsearch service is now critical: Unable to retrieve version information from Elasticsearch nodes. getaddrinfo ENOTFOUND fdea58af242b4ddb9bd7671aa4a4bcf4.es.us-east-1.aws.internal.elastic.cloud
You're actually right, I totally missed it, but we're switching the service's status to critical
if any error occurs during the healthcheck, until an healthcheck effectively succeed.
I updated the issue's description accordingly.
It probably also changes the scope of the issue to discussion more than enhancement 😅
fwiw, please be aware of the big investments made by elasticsearch to its health API in recent months. This may shape how we approach this problem.
I think the first thing we need to do here is defining the concept of "elasticsearch status" from Kibana's standpoint we want.
When the Kibana service status API was designed, the concept of "service status" was supposed to be used to report the actual status of our services. This is especially important for the elasticsearch service, as the idea was basically to report whether or not the service was able function, so to communicate with the ES cluster, and not the status of the cluster itself. It might have been fairly naive, but that's what the thinking was at the time.
Which is why the current status of this service is based simply pinging the cluster at a fixed interval, and reporting:
ok
if the cluster responded (and all nodes are matching the expected version) critical
we didn't get a response for any reason (or if any node doesn't match the expected version)The point is, is was supposed to be reporting the status of our service. If ES is unhealthy for any reason, it wasn't supposed to be our concern, as long as we can reach and communicate with it.
So before discussing the technicals of how we could be improving this service's status reporting, we need to decide if we want to stick to that decision or if we want to revisit it.
Having Kibana's "ES status" depends on health information directly reported by ES may be more reliable (and probably even easier), and would also avoid "desyncs" between Kibana nodes (as a status computed from the actual usage would vary from one Kibana node to another in a multi-node setup), but could potentially miss some env/infra related issues (e.g bad networking between Kibana and ES...)
As a side note, regarding the discussions driven by @lukeelmers in parallel regarding the need to more easily surface ES's status to plugins, especially on the browser-side, to get to a better user experience, I still wonder if we're not mixing two things that should remain dissociated here. Maybe we should just keep our "elastic status as seen from Kibana" for our status API, and develop something in parallel if solution teams needs to know about the actual/concrete state of the ES cluster itself?
Especially interested in what @rudolf and @lukeelmers think about it.
Maybe we should just keep our "elastic status as seen from Kibana" for our status API, and develop something in parallel if solution teams needs to know about the actual/concrete state of the ES cluster itself?
Thanks for getting the conversation going @pgayvallet. I'm not sure if there's a need for teams to know the overall state of the ES cluster in the browser, and I tend to agree that's a separate problem. I think the main issue we are trying to solve in the UX is making it so Kibana doesn't appear broken when ES is broken. One thing that would help with this is having messaging in the UI to make it clear to users when an issue they are encountering actually has nothing to do with Kibana, but rather something that's going on with ES.
This means that in practice, the types of issues we'd want to report are likely things that would fall into the category of elastic status as seen from Kibana
. However, I wouldn't say that rules out optionally consuming ES health APIs on our side to inform the decision... it just doesn't seem to me we'd want to surface those results as-is directly to the UI.
One way I could see us doing this iteratively is:
I agree that these are fundamentally two different problems even if technically we might decide to solve it from the same service. I tried to formulate the problems below, but I'm still struggling to exactly articulate it.
At the moment, the
elasticsearch
status logic is fairly simple:unavailable
nodes.info
( frompollEsNodesVersion
)critical
critical
available
Meaning that atm the service's status is "red or green": either everything is fine (avalable) or nothing is (critical)
Ideally, the quality of the connection between Kibana and ES should be reflected in this status. For example, I would expect the status to depend of the actual usage of the ES client instead of being exclusively based on health checks. That way, we could think of something like:
degraded
critical
and/orunavailable
Of course, the major functional challenge would be in putting the cursor at the right level. We probably don't want to switch the service to
degraded
on a single timeout, but we probably don't want to wait until 100% of our requests to timeout before switching our status. Same question about deciding when things are "fine again".On the technical side, we also need to find how we can properly listen to all requests to retrieve this information. Maybe this needs to be done at the
transport
level.Tech pointers
https://github.com/elastic/kibana/blob/35083504464a4950d5ce5c6464cab387dd98d0d8/packages/core/elasticsearch/core-elasticsearch-server-internal/src/version_check/ensure_es_version.ts#L147-L172
https://github.com/elastic/kibana/blob/12466d8b17d8557ff0b561c346511bd1760da4c1/packages/core/elasticsearch/core-elasticsearch-server-internal/src/status.ts#L15-L71
https://github.com/elastic/kibana/blob/12466d8b17d8557ff0b561c346511bd1760da4c1/packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/create_transport.ts#L26-L86