After recent outage events (VC-AuthN) of our services, we need to re-assess whether we are monitoring everything we need to be proactive in preventing this type of situations, and where the monitoring needs to happen.
In particular, items we are interested in keeping an eye on are:
PVC filling up
Pods in unhealthy state
Throttling
Additionally, we want to assess whether the list of services for which we track uptime/availability is complete, or we need to add further endpoints to our Uptime dashboards.
Desired outcomes:
Determine which tool in our monitoring stack (platform and self-hosted) is best to monitor each one of the items listed above.
Compile a list of what needs to be added/tweaked, and where this action needs to be performed.
Log issues related to these activities, and place them on the backlog for assignment.
After recent outage events (VC-AuthN) of our services, we need to re-assess whether we are monitoring everything we need to be proactive in preventing this type of situations, and where the monitoring needs to happen.
In particular, items we are interested in keeping an eye on are:
Additionally, we want to assess whether the list of services for which we track uptime/availability is complete, or we need to add further endpoints to our Uptime dashboards.
Desired outcomes:
Related issues: