Description

As an ops lead, I want to know if the database is in a dire situation when an overload situation occurs, So that I can take appropriate actions.

WHY are we building?

To know when our database might be in a unhealthy situation. We assumed we had this but realized now that we do not have these.

Add regular alarms around our database health metrics such as CPU and memory usage percentage.

Stability, proactivity, awareness.

[ ] Alarms on key health metrics are created for our main database (CPU, memory, etc).

[ ] Test with lowered thresholds and send a burst to trigger the alarms in staging environment.