[Alerting] Phase 2 of AoA: What metrics answer the first set of questions

Meta: https://github.com/elastic/kibana/issues/120542

Based on the questions from phase 1, determine which metrics need to be collected and how. For example, if the questions above require us to collect data for each individual rule, we need to ensure our indexing strategy will allow us to store data in a fashion that makes it easy to build alerts and charts later (such as needing or not needing to use nested fields).

How many rules are pending to run?

We can query the task manager index to detect "delayed" tasks which means either runAt() < now and status = Idle or retryAt() < now and (status = Running || status = Claimed)

How many actions are pending to run?

Same as above but only for actions

How long are rules waiting to run?

From the above data, calculate p50 and p99 percentiles for how long overdue they are

How long are actions waiting to run?

Same as above but for actions

Did an alert fail to execute? When?

We will add in memory counters that will report total and failed executions across all rule types for each Kibana - we can sum these up when attempting to visualize since each monitoring document from each Kibana will contain the unique Kibana uuid

elastic / kibana