[Stack Monitoring] Improve Missing Monitoring Data rule

neptunian commented 2 years ago

After investigating the slow performance of this rule when created with the default values of looking back 1 day we found this rule has some shortcomings. The way this rule works is we query for all data in the range of now - lookback. Per each cluster, per each node, we subtract now from the last document's timestamp and if that value is greater than duration then we fire an alert. duration and lookback are configurable by the user and when we create an OOTB rule of this type for the user we set the defaults below:

When it alerts it specifies which node has the issue. The problem with this approach is once the time range has passed and data no longer exists it will no longer report missing data on a node. Some changes we could make:

similar to Metrics Threshold rule where we keep track of the groups (nodes) from one execution to the next, we could do the same thing here
remove the lookback option if we can track the groups
consider changing this rule to only alert on a product basis (this was changed for ES only due to issues with other products). So in the case of ES alert me when there is no elasticsearch data instead of having to track nodes. Or this could be a different rule.

elasticmachine commented 2 years ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

ravikesarwani commented 2 years ago

I like the idea of 2 separate rule. One focused on the whole cluster and one focused on node.

Cluster missing monitoring data
Node missing monitoring data

For both of these rules I feel we will require some concept of alerting only when there was data before and now data is missing for a little while. We should handle the scenario gracefully where nodes are taken out of the cluster, a scenario that will happen all the time in the field in the lifetime of a Elasticsearch cluster. When customers are working with a Elasticsearch cluster of 40/60/100 nodes we need to think how often we should generate the alert such that customers don’t get over-warmed. A cluster level rule/alert provides value but my take is it provides only a very limited value where our architecture requires metricbeat/Agent running on each node of the cluster.

simianhacker commented 2 years ago

I would recommend we create a seperate missing data rule for every entity in the system: Kibana, Metricbeat, Filebeat, APM Server, Nodes, and Clusters. As a customer, I would expect to be notified when any of these disappear from the cluster. As for the rule evaluation, we should use an Elasticsearch query to push the missing entity detection to Elasticsearch.

The following example is for detecting nodes when they drop out of the cluster or stop reporting. The idea is to query Elasticsearch using a range filter that spans across the last rule execution and the current rule execution. To determine if a node has gone missing or is new/recovered we need to create two buckets using a filter aggregation that represents lastPeriod and currentPeriod (using a range filter); this will give us a document count for each period.

Once we have the document count for each period, we can use a bucket_script, named isNodeMissing, to evaluate if the node is missing by checking if the document count for the lastPeriod is greater than 0 and the currentPeriod is less than 1. To determine if a node is recovered or new, we can use a second bucket_script, named isNodeRecoveredOrNew, to see if the lastPeriod is less than 1 and the currentPeriod is greater than 0. For each of these bucket scripts, we will return either 1 or 0 since the bucket script can not return a boolean.

With isNodeMissing and isNodeRecoveredOrNew, we can use a bucket_selector to only return the nodes where isNodeMissing > 0 or isNodeRecoveredOrNew > 0. In Kibana, we will need to keep track of only the nodes where isNodeMissing === 1 in the rule state. If a node recovers, isNodeRecoveredOrNew === 1, we need delete the node from the rule state. Finally, for every node we are tracking in the rule state from past executions and the current, we need to trigger a "NO DATA" alert every time the rule executes.

Along with the missing nodes, we also need to track the last execution time of the previous execution so we can use it to create the range query that covers both. For most of the monitoring data, looking at a 5 minute window for each period should be sufficient. This means we would actually query for approximately 10 minutes of data, from the start of the last execution to the end of the current. In a perfect world, we could simply create 2 equal sized buckets but unfortunately the Kibana Alerting system has some drift which is why we need to use the timestamp of last execution rather than assuming it never drifts.

In the example query below, I'm just using a 10 minute time range with two equal 5 minute periods but in the final implementation, the lastPeriod bucket should use the last execution time minus the window size, 5m; the range query should span from the last execution timestamp, minus the window, to now.

POST .monitoring-es-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "now-10m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "type": "node_stats"
          }
        }
      ]
    }
  },
  "aggs": {
    "nodes": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "cluster": {
              "terms": {
                "field": "cluster_uuid"
              }
            }
          },
          {
            "node": {
              "terms": {
                "field": "node_stats.node_id"
              }
            }
          }
        ]
      },
      "aggs": {
        "lastPeriod": {
          "filter": {
            "range": {
              "timestamp": {
                "gte": "now-10m",
                "lte": "now-5m"
              }
            }
          }
        },
        "currentPeriod": {
          "filter": {
            "range": {
              "timestamp": {
                "gte": "now-5m",
                "lte": "now"
              }
            }
          }
        },
        "isNodeMissing": {
          "bucket_script": {
            "buckets_path": {
              "lastPeriod": "lastPeriod>_count",
              "currentPeriod": "currentPeriod>_count"
            },
            "script": "params.lastPeriod > 0 && params.currentPeriod < 1 ? 1 : 0"
          }
        },
        "isNodeRecoveredOrNew": {
          "bucket_script": {
            "buckets_path": {
              "lastPeriod": "lastPeriod>_count",
              "currentPeriod": "currentPeriod>_count"
            },
            "script": "params.lastPeriod < 1 && params.currentPeriod > 0 ? 1 : 0"
          }
        },
        "evaluation": {
          "bucket_selector": {
            "buckets_path": {
              "isNodeMissing": "isNodeMissing",
              "isNodeRecoveredOrNew": "isNodeRecoveredOrNew"
            },
            "script": "params.isNodeMissing > 0 || params.isNodeRecoveredOrNew > 0"
          }
        }
      }

    }
  }
}

This should simplify the Kibana code to just a few parts:

Build the query DSL
Query Elasticsearch for every page of the composite agg
Add/Delete missing entities from rule state
Save the current execution timestamp in the rule state
Trigger alerts for all the missing entities in the rule state

This will also improve the performance of these rules because we only need to query approximately 10 minutes of data instead of looking back 24 hours every time it runs. It also eliminate the bug where after 24 hours missing nodes recover because they are no longer showing up in the query.

miltonhultgren commented 2 years ago

I'm wondering how these kinds of rule intersect with the planned Health and Topology APIs?

elastic / kibana

[Stack Monitoring] Improve Missing Monitoring Data rule #126709