False positives while monitoring nodes in Elasticsearch Service

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.63k stars 8.22k forks source link

False positives while monitoring nodes in Elasticsearch Service #101712

Open crisdarocha opened 3 years ago

crisdarocha commented 3 years ago

Describe the issue:

Stack monitoring includes a "Missing Monitoring Data" alert, which is a good indicator that a node is not responding.

This becomes a problem when one is running a cluster in Elasticsearch Service (ESS). Instances are replaceable in ESS. When an allocator is vacated, a new instance is created in another allocator, with another name, and the existing one is removed.

This triggers the Missing Monitoring Data alert, as the node is "missing".

Cloud instances are volatile, but monitoring considers them are persistent assets.

It would be very good to make monitoring aware of this "instance swap" for ESS deployments.

This probably lives in the interface between Monitoring and Cloud, but starting the discussion here. One option would be to have the orchestration tell monitoring that a node will be replaced.

CC: @jakommo @jalogisch

elasticmachine commented 3 years ago

Pinging @elastic/stack-monitoring (Team:Monitoring)

matschaffer commented 3 years ago

Or at least have a way to default this to off in cloud.

If I put my customer hat on I don't care too much if we have more or fewer nodes than expected at any given point as long as the data is flowing.

matschaffer commented 3 years ago

Alternatively if I did really want to know, I'd probably care most in cases where my node count was below expected for some unreasonable length of time.

Is there a concept of "expected topology"? If there were we could probably have the orchestration products (ECE/ECK) push that sort of info to the monitoring cluster. Though I could see maybe this maybe more sense at the orchestration layer rather than in stack monitoring.

jakommo commented 3 years ago

I don't think we have an "expected topology" concept, but I like the idea.

IMO the ask is to add some way (API?) to stack monitoring that would allow the orchestration to inform stack monitoring that the next topology change / node replacement is intended and not indicating an issue. So yes, the orchestration would make the call for that, but first we need an endpoint it could call.

crisdarocha commented 3 years ago

@matschaffer thanks for chiming in. One use of this would be when a node gets stuck and the orchestration doesn't pick the problem up and fixes the issue. You are working with one node less and will only notice that when your service start suffering from the performance drop, which is sub-optimal. But the other side is to avoid the false positives.

The expected topology sounds very interesting. "Do I have all the nodes I'm supposed to have, regardless of their names?"

matschaffer commented 3 years ago

ECK (today) and ECE (soon) should be doing a continual loop to confirm the cluster meets expected topology (cc @anyasabo), so if it's not I'd hope the orchestration system would pick up on a discrepancy soon enough that stack monitoring wouldn't necessarily have to.

But in either cloud, yes I couldn't care less about the names. Just that I have the at least the capacity I'm paying for.

jasonrhodes commented 3 years ago

Do we have any way of introspecting whether a node is ephemeral or persistent, so we could only alert on the non-existence of a persistent node?

anyasabo commented 3 years ago

@henningandersen given what we were discussing about potentially informing ES about the "desired state" of the cluster, I wonder how/if that would relate to Jason's question above? I don't think there's any way for stack monitoring to know "this is a node that we expect to be dead" vs "yes i would like to know if this node goes down" at the moment though.

henningandersen commented 3 years ago

@anyasabo the desired nodes API should help here in that stack monitoring could then compare the desired nodes to the actual nodes. In fact, part of the document describes a health check. I imagine including more health info into cluster health or a new API and it could there list missing nodes, which stack monitoring could rely upon.

The caveat is that stack monitoring also need to work outside orchestrated environments, i.e., will need separate logic for the two cases.