Prio 1 = customer impact, need immediate action
Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure
Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid
Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid
In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.
Prio 3 or higher*: As a Cloud Operator, I want to know if multiple replicas of a HA L3 router think they are in master state, because that indicates a potentially customer-visible network issue (ARP fight or L2 loss between two nodes).
Prio 3 or higher*: As a Cloud Operator, I want to know if an HA L3 router has no replica in master state, as that renders the router dysfunctional, which has customer-visible impact, because the traffic is not going to reach the instances because the upstream router cannot find the MAC address to send the traffic to.
Prio 4 or higher*: As a Cloud Operator, I want to know if the number of HA L3 router replicas is below the configured number for a longer time.
(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.
Prio 1 = customer impact, need immediate action Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid
In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.
(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.