[Alerting][Event Log] Consider adding `uuid` to active alert spans

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.5k stars 8.06k forks source link

[Alerting][Event Log] Consider adding `uuid` to active alert spans #101749

Closed ymao1 closed 1 week ago

ymao1 commented 3 years ago

For this issue, we added start/duration/end times to the *-instance actions in the event log and considered adding a uuid to identify unique active spans for an alert. We decided to hold off after reviewing what SIEM and RAC were doing for this and how they are using event.id.

Currently, the lifecycle rule type in the rule registry is doing something similar but storing it in the kibana.rac.alert.uuid field. SIEM is using event.id to store the original source document id when a source document is copied into the signals index. When the signal generated is an aggregate over multiple source documents, the event.id field is not populated.

Given these other usages, do we want to add a uuid field to identify active alert spans? If we do, should we use the event.id field to store it? Or consolidate it with a RAC field?

elasticmachine commented 3 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr commented 3 years ago

I'm hesitant to use event.id for this, since I don't know it's purpose, and it seems fairly "global". I was thinking something in rule. or kibana.alerting; a field in rule would be the best, if we can agree on a field in there - but maybe there's no good fit there.

Currently feels like alerting should be creating the UUID for the new "span" of alerts, and then make it available to the rule registry somehow, for it's uses. Not quite sure yet how we'll thread the value through, but you can see the place the changes would go for RAC, around the following code. This is where the rule executor is actually invoked, and that code will be calling scheduleActions() - the alert UUIDs should have been generated by the time the executor has returned, and be made available to the rule registry framework.

https://github.com/elastic/kibana/blob/b58054cf2621b4bd11a2e1c4317d5df926939aca/x-pack/plugins/rule_registry/server/utils/create_lifecycle_executor.ts#L162-L175

pmuellr commented 2 years ago

Taking another peek at this. Looks like RAC creates the UUIDs for lifecycle alerts, here:

https://github.com/elastic/kibana/blob/b58054cf2621b4bd11a2e1c4317d5df926939aca/x-pack/plugins/rule_registry/server/utils/create_lifecycle_executor.ts#L259-L262

So it appears the UUIDs are created after running the executor, so I think we can create/manage the UUIDs when scheduleActions() is run (need to deal with unscheduleActions() or any other mutators), and then arrange to be able to return that data in a new method on AlertServices, which could be called from the RAC wrapper. For example, something like:

interface AlertServices {
  ...
  getInstances(): Map<string, string> // key: existing alert instance ID; value: new alert instance UUID
}

pmuellr commented 2 years ago

Happened to remember we had a similar issue we had open a while back: https://github.com/elastic/kibana/issues/64268

For that one, we realized that some rule types were already using UUIDs as their instance ids, so we thought we should add a new "human readable" to associate with an instance. I think that ship has sailed at this point, since we now have an "official" UUID - we should continue to shoot to make the alert instance id's as human readable. But may need to revisit that over time, perhaps adding an explicit "description" to these alert instances would make sense later.

gmmorris commented 2 years ago

It's worth noting that without this there is actually no way of using the span as part of a dedup key in connectors such as PagerDuty.

This means that a customer can't set up actions on a rule so that they get a new incident whenever a specific alert ID reappears (so, for instance, get a new incident whenever the CPU exceeds 90% on Host #1, rather than reopen the incident form the last time it exceeded 90%).

This feels like a relatively basic missing feature. What do you think @arisonl & @mikecote ?

mikecote commented 2 years ago

I agree, allowing access to some span ID would allow to mimic alerts as data on an external system, create new incidents whenever an alert comes back.

@arisonl should this even become the default dedup key? instead of {ruleId}:{alertId} it becomes {ruleId}:{spanId}?

pmuellr commented 1 week ago

We have since added kibana.alert.uuid as a unique identifier of alerts from when created till they recover.