L3 router replication status

horazont commented 3 years ago

Prio 1 = customer impact, need immediate action Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid

In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.

Prio 3 or higher*: As a Cloud Operator, I want to know if multiple replicas of a HA L3 router think they are in master state, because that indicates a potentially customer-visible network issue (ARP fight or L2 loss between two nodes).
Prio 3 or higher*: As a Cloud Operator, I want to know if an HA L3 router has no replica in master state, as that renders the router dysfunctional, which has customer-visible impact, because the traffic is not going to reach the instances because the upstream router cannot find the MAC address to send the traffic to.
Prio 4 or higher*: As a Cloud Operator, I want to know if the number of HA L3 router replicas is below the configured number for a longer time.

(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.

JohnGarbutt commented 3 years ago

The other issue I have seen is when it jumps around between nodes, it’s worth looking for that case as well.

berendt commented 2 years ago

Perhaps https://github.com/osism/openstack-router-status can be recycled in the context.

SovereignCloudStack / standards

L3 router replication status #98