Open neptunian opened 2 years ago
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)
I like the idea of 2 separate rule. One focused on the whole cluster and one focused on node.
For both of these rules I feel we will require some concept of alerting only when there was data before and now data is missing for a little while. We should handle the scenario gracefully where nodes are taken out of the cluster, a scenario that will happen all the time in the field in the lifetime of a Elasticsearch cluster. When customers are working with a Elasticsearch cluster of 40/60/100 nodes we need to think how often we should generate the alert such that customers don’t get over-warmed. A cluster level rule/alert provides value but my take is it provides only a very limited value where our architecture requires metricbeat/Agent running on each node of the cluster.
I would recommend we create a seperate missing data rule for every entity in the system: Kibana, Metricbeat, Filebeat, APM Server, Nodes, and Clusters. As a customer, I would expect to be notified when any of these disappear from the cluster. As for the rule evaluation, we should use an Elasticsearch query to push the missing entity detection to Elasticsearch.
The following example is for detecting nodes when they drop out of the cluster or stop reporting. The idea is to query Elasticsearch using a range filter that spans across the last rule execution and the current rule execution. To determine if a node has gone missing or is new/recovered we need to create two buckets using a filter
aggregation that represents lastPeriod
and currentPeriod
(using a range
filter); this will give us a document count for each period.
Once we have the document count for each period, we can use a bucket_script
, named isNodeMissing
, to evaluate if the node is missing by checking if the document count for the lastPeriod
is greater than 0
and the currentPeriod
is less than 1
. To determine if a node is recovered or new, we can use a second bucket_script
, named isNodeRecoveredOrNew
, to see if the lastPeriod
is less than 1
and the currentPeriod
is greater than 0
. For each of these bucket scripts, we will return either 1
or 0
since the bucket script can not return a boolean
.
With isNodeMissing
and isNodeRecoveredOrNew
, we can use a bucket_selector
to only return the nodes where isNodeMissing > 0
or isNodeRecoveredOrNew > 0
. In Kibana, we will need to keep track of only the nodes where isNodeMissing === 1
in the rule state. If a node recovers, isNodeRecoveredOrNew === 1
, we need delete the node from the rule state. Finally, for every node we are tracking in the rule state from past executions and the current, we need to trigger a "NO DATA" alert every time the rule executes.
Along with the missing nodes, we also need to track the last execution time of the previous execution so we can use it to create the range query that covers both. For most of the monitoring data, looking at a 5 minute window for each period should be sufficient. This means we would actually query for approximately 10 minutes of data, from the start of the last execution to the end of the current. In a perfect world, we could simply create 2 equal sized buckets but unfortunately the Kibana Alerting system has some drift which is why we need to use the timestamp of last execution rather than assuming it never drifts.
In the example query below, I'm just using a 10 minute time range with two equal 5 minute periods but in the final implementation, the lastPeriod
bucket should use the last execution time minus the window size, 5m
; the range query should span from the last execution timestamp, minus the window, to now.
POST .monitoring-es-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "now-10m",
"lte": "now"
}
}
},
{
"term": {
"type": "node_stats"
}
}
]
}
},
"aggs": {
"nodes": {
"composite": {
"size": 10000,
"sources": [
{
"cluster": {
"terms": {
"field": "cluster_uuid"
}
}
},
{
"node": {
"terms": {
"field": "node_stats.node_id"
}
}
}
]
},
"aggs": {
"lastPeriod": {
"filter": {
"range": {
"timestamp": {
"gte": "now-10m",
"lte": "now-5m"
}
}
}
},
"currentPeriod": {
"filter": {
"range": {
"timestamp": {
"gte": "now-5m",
"lte": "now"
}
}
}
},
"isNodeMissing": {
"bucket_script": {
"buckets_path": {
"lastPeriod": "lastPeriod>_count",
"currentPeriod": "currentPeriod>_count"
},
"script": "params.lastPeriod > 0 && params.currentPeriod < 1 ? 1 : 0"
}
},
"isNodeRecoveredOrNew": {
"bucket_script": {
"buckets_path": {
"lastPeriod": "lastPeriod>_count",
"currentPeriod": "currentPeriod>_count"
},
"script": "params.lastPeriod < 1 && params.currentPeriod > 0 ? 1 : 0"
}
},
"evaluation": {
"bucket_selector": {
"buckets_path": {
"isNodeMissing": "isNodeMissing",
"isNodeRecoveredOrNew": "isNodeRecoveredOrNew"
},
"script": "params.isNodeMissing > 0 || params.isNodeRecoveredOrNew > 0"
}
}
}
}
}
}
This should simplify the Kibana code to just a few parts:
This will also improve the performance of these rules because we only need to query approximately 10 minutes of data instead of looking back 24 hours every time it runs. It also eliminate the bug where after 24 hours missing nodes recover because they are no longer showing up in the query.
I'm wondering how these kinds of rule intersect with the planned Health and Topology APIs?
After investigating the slow performance of this rule when created with the default values of looking back 1 day we found this rule has some shortcomings. The way this rule works is we query for all data in the range of
now
-lookback
. Per each cluster, per each node, we subtractnow
from the last document's timestamp and if that value is greater thanduration
then we fire an alert.duration
andlookback
are configurable by the user and when we create an OOTB rule of this type for the user we set the defaults below:When it alerts it specifies which node has the issue. The problem with this approach is once the time range has passed and data no longer exists it will no longer report missing data on a node. Some changes we could make:
lookback
option if we can track the groups