SovereignCloudStack / standards

SCS standards in a machine readable format
https://scs.community/
Creative Commons Attribution Share Alike 4.0 International
34 stars 23 forks source link

L3 router replication status #98

Open horazont opened 3 years ago

horazont commented 3 years ago

Prio 1 = customer impact, need immediate action Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid

In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.

(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.

JohnGarbutt commented 3 years ago

The other issue I have seen is when it jumps around between nodes, it’s worth looking for that case as well.

berendt commented 2 years ago

Perhaps https://github.com/osism/openstack-router-status can be recycled in the context.