department-of-veterans-affairs /

Public resources for building on and in support of Visit complete Knowledge Hub:
281 stars 196 forks source link

Centralize Monitoring: Set up monitoring and alerting for platform services (in Datadog) #47025

Open jhouse-solvd opened 1 year ago

jhouse-solvd commented 1 year ago

Problem Statement

Monitors and alerts are spread across multiple systems. This negatively impacts the platform's ability to respond to incidents and support issues. And this leads to a confusing monitoring experience for platform operators. Additionally, maintaining multiple monitoring systems increases administrative overhead.


How might we...

Hypothesis or Bet

This initiative should...

We will know we're done when... ("Definition of Done")

Known Blockers/Dependencies

List any blockers or dependencies for this work to be completed

Projected Launch Date

December 31, 2022

Launch Checklist

Is this service / tool / feature...

... tested?

... documented?

... measurable

When you're ready to launch...

Required Artifacts





jhouse-solvd commented 1 year ago

In speaking with @ph-One we should prioritize "top-tier" monitors to know which are the most important. Monitors, rulesets, alerts, metrics, etc should be listed in sequential order of importance. Then, issues can be created to tackle one by one.

mchelen-gov commented 1 year ago

Is there internal documentation that needs to be created or updated?

jhouse-solvd commented 1 year ago

Is there internal documentation that needs to be created or updated?

updated definition of done

jhouse-solvd commented 1 year ago

@ph-One and I were thinking that we may need to re-examine PagerDuty teams and services in light of the recent team restructuring.

jhouse-solvd commented 1 year ago

It might be better to scope this work to focus on Alertmanager rules related to critical monitors, ie "devops-critical" and "vsp-engineers-critical" (added to background context above)

jhouse-solvd commented 1 year ago

How this initiative is broken down:

jhouse-solvd commented 1 year ago

@mchelen-gov - Can you add this initiative to the DE product board? It is currently in progress.