As a VA.Gov Platform Operator, I need monitoring for CI/CD components and services to help maintain the availability of platform services (and their dependencies).
Hypothesis or Bet
Thoughts: Platform operators should be able to
acknowledge issues more quickly
respond to issues with better information
address issues proactively
Note: Can we tighten these up and measure specific outcomes?
What must be true in order for you to consider this epic complete?
Dashboards exist for the following:
[ ] ArgoCD
[ ] GitHub actions (as-a-service; API quotas? GH service outages?)
[ ] GitHub runners (as-a-service; API quotas? GH service outages?)
[ ] cloud runners
[ ] self-hosted runners
[ ] Jenkins
Note: Multiple services can be displayed on the same dashboard, ie there does not need to be a dashboard for every service, but that service should be represented on a dashboard.
Monitors exist for the following:
[ ] ArgoCD
[ ] GitHub actions (as-a-service)
[ ] GitHub runners
[ ] cloud runners
[ ] self-hosted runners
[ ] Jenkins
Alerts exist for the following:
[ ] ArgoCD
[ ] GitHub actions (as-a-service)
[ ] GitHub runners
[ ] cloud runners
[ ] self-hosted runners
[ ] Jenkins
Note: Alerts should be routed to the appropriate channel/personnel based on the alert's severity.
Product Outline
Monitoring using DataDog
High-Level User Story/ies
As a VA.Gov Platform Operator, I need monitoring for CI/CD components and services to help maintain the availability of platform services (and their dependencies).
Hypothesis or Bet
Thoughts: Platform operators should be able to
Note: Can we tighten these up and measure specific outcomes?
Tech notes
Definition of done
What must be true in order for you to consider this epic complete?
Dashboards exist for the following:
Note: Multiple services can be displayed on the same dashboard, ie there does not need to be a dashboard for every service, but that service should be represented on a dashboard.
Monitors exist for the following:
Alerts exist for the following:
Note: Alerts should be routed to the appropriate channel/personnel based on the alert's severity.