Monitoring and Tooling Review

After recent outage events (VC-AuthN) of our services, we need to re-assess whether we are monitoring everything we need to be proactive in preventing this type of situations, and where the monitoring needs to happen.

In particular, items we are interested in keeping an eye on are:

PVC filling up
Pods in unhealthy state
Throttling

Additionally, we want to assess whether the list of services for which we track uptime/availability is complete, or we need to add further endpoints to our Uptime dashboards.

Desired outcomes:

Determine which tool in our monitoring stack (platform and self-hosted) is best to monitor each one of the items listed above.
Compile a list of what needs to be added/tweaked, and where this action needs to be performed.
Log issues related to these activities, and place them on the backlog for assignment.

Related issues:

https://github.com/bcgov/DITP-DevOps/issues/181
https://github.com/bcgov/DITP-DevOps/issues/182
https://github.com/bcgov/DITP-DevOps/issues/183

bcgov / DITP-DevOps

Monitoring and Tooling Review #184