Open crisdarocha opened 3 years ago
Pinging @elastic/stack-monitoring (Team:Monitoring)
Or at least have a way to default this to off in cloud.
If I put my customer hat on I don't care too much if we have more or fewer nodes than expected at any given point as long as the data is flowing.
Alternatively if I did really want to know, I'd probably care most in cases where my node count was below expected for some unreasonable length of time.
Is there a concept of "expected topology"? If there were we could probably have the orchestration products (ECE/ECK) push that sort of info to the monitoring cluster. Though I could see maybe this maybe more sense at the orchestration layer rather than in stack monitoring.
I don't think we have an "expected topology" concept, but I like the idea.
IMO the ask is to add some way (API?) to stack monitoring that would allow the orchestration to inform stack monitoring that the next topology change / node replacement is intended and not indicating an issue. So yes, the orchestration would make the call for that, but first we need an endpoint it could call.
@matschaffer thanks for chiming in. One use of this would be when a node gets stuck and the orchestration doesn't pick the problem up and fixes the issue. You are working with one node less and will only notice that when your service start suffering from the performance drop, which is sub-optimal. But the other side is to avoid the false positives.
The expected topology sounds very interesting. "Do I have all the nodes I'm supposed to have, regardless of their names?"
ECK (today) and ECE (soon) should be doing a continual loop to confirm the cluster meets expected topology (cc @anyasabo), so if it's not I'd hope the orchestration system would pick up on a discrepancy soon enough that stack monitoring wouldn't necessarily have to.
But in either cloud, yes I couldn't care less about the names. Just that I have the at least the capacity I'm paying for.
Do we have any way of introspecting whether a node is ephemeral or persistent, so we could only alert on the non-existence of a persistent node?
@henningandersen given what we were discussing about potentially informing ES about the "desired state" of the cluster, I wonder how/if that would relate to Jason's question above? I don't think there's any way for stack monitoring to know "this is a node that we expect to be dead" vs "yes i would like to know if this node goes down" at the moment though.
@anyasabo the desired nodes API should help here in that stack monitoring could then compare the desired nodes to the actual nodes. In fact, part of the document describes a health check. I imagine including more health info into cluster health or a new API and it could there list missing nodes, which stack monitoring could rely upon.
The caveat is that stack monitoring also need to work outside orchestrated environments, i.e., will need separate logic for the two cases.
Describe the issue:
Stack monitoring includes a "Missing Monitoring Data" alert, which is a good indicator that a node is not responding.
This becomes a problem when one is running a cluster in Elasticsearch Service (ESS). Instances are replaceable in ESS. When an allocator is vacated, a new instance is created in another allocator, with another name, and the existing one is removed.
This triggers the Missing Monitoring Data alert, as the node is "missing".
Cloud instances are volatile, but monitoring considers them are persistent assets.
It would be very good to make monitoring aware of this "instance swap" for ESS deployments.
This probably lives in the interface between Monitoring and Cloud, but starting the discussion here. One option would be to have the orchestration tell monitoring that a node will be replaced.
CC: @jakommo @jalogisch