Closed ymao1 closed 1 week ago
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
I'm hesitant to use event.id
for this, since I don't know it's purpose, and it seems fairly "global". I was thinking something in rule.
or kibana.alerting
; a field in rule
would be the best, if we can agree on a field in there - but maybe there's no good fit there.
Currently feels like alerting should be creating the UUID for the new "span" of alerts, and then make it available to the rule registry somehow, for it's uses. Not quite sure yet how we'll thread the value through, but you can see the place the changes would go for RAC, around the following code. This is where the rule executor is actually invoked, and that code will be calling scheduleActions()
- the alert UUIDs should have been generated by the time the executor has returned, and be made available to the rule registry framework.
Taking another peek at this. Looks like RAC creates the UUIDs for lifecycle alerts, here:
So it appears the UUIDs are created after running the executor, so I think we can create/manage the UUIDs when scheduleActions()
is run (need to deal with unscheduleActions()
or any other mutators), and then arrange to be able to return that data in a new method on AlertServices
, which could be called from the RAC wrapper. For example, something like:
interface AlertServices {
...
getInstances(): Map<string, string> // key: existing alert instance ID; value: new alert instance UUID
}
Happened to remember we had a similar issue we had open a while back: https://github.com/elastic/kibana/issues/64268
For that one, we realized that some rule types were already using UUIDs as their instance ids, so we thought we should add a new "human readable" to associate with an instance. I think that ship has sailed at this point, since we now have an "official" UUID - we should continue to shoot to make the alert instance id's as human readable. But may need to revisit that over time, perhaps adding an explicit "description" to these alert instances would make sense later.
It's worth noting that without this there is actually no way of using the span as part of a dedup key in connectors such as PagerDuty.
This means that a customer can't set up actions on a rule so that they get a new incident whenever a specific alert ID reappears (so, for instance, get a new incident whenever the CPU exceeds 90% on Host #1, rather than reopen the incident form the last time it exceeded 90%).
This feels like a relatively basic missing feature. What do you think @arisonl & @mikecote ?
I agree, allowing access to some span ID would allow to mimic alerts as data on an external system, create new incidents whenever an alert comes back.
@arisonl should this even become the default dedup key? instead of {ruleId}:{alertId}
it becomes {ruleId}:{spanId}
?
We have since added kibana.alert.uuid
as a unique identifier of alerts from when created till they recover.
For this issue, we added start/duration/end times to the
*-instance
actions in the event log and considered adding auuid
to identify unique active spans for an alert. We decided to hold off after reviewing what SIEM and RAC were doing for this and how they are usingevent.id
.Currently, the lifecycle rule type in the rule registry is doing something similar but storing it in the
kibana.rac.alert.uuid
field. SIEM is usingevent.id
to store the original source document id when a source document is copied into the signals index. When the signal generated is an aggregate over multiple source documents, theevent.id
field is not populated.Given these other usages, do we want to add a
uuid
field to identify active alert spans? If we do, should we use theevent.id
field to store it? Or consolidate it with a RAC field?