department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 198 forks source link

2024 Q3 Platform Product and Platform Support: Discovery on LOE to Implement Actionable Alerts Process (dupe 1) #92546

Open jennb33 opened 1 week ago

jennb33 commented 1 week ago

This ticket is a dupe of 88370, and was duped for Sprint 10 to finish the work.

Summary

As the Platform environment grows in complexity, it's becoming increasingly crucial to track all alert errors during the Support process. Platform Product team had done an informal review of what setting up a robust actionable alerts system would take, and we determined that the ROI on those efforts isn't of high enough value. Our alerting mechanism notifies when there are potential issues, but lacks the capability to provide actionable insights or automated responses.

It is the Platform Product team's thought that integrating this into the current Support process would be most effective, given that that's how our developers have been addressing the alerts in real time (IRT).

When there are alerts that have been going off continually, Tier 1 support usually involves Tier 2, depending on the thresholds that have been set for the alerts. It is important, when implementing this "IRT" strategy, that the process is standardized and shared with all resources as they cycle through the Support rounds.

The Platform Product team will do analysis on the current thresholds to ensure that the alerts are accurate with their current configurations, and will adjust and document the alerts as necessary while doing this analysis. This information would then be shared with Support team, who will standardize how the alerts get fixed during Support Rotation. Support team would be responsible for updating all resources who are on Support Tier 2 FE, BE and DevOps. The Platform Product team would require documentation of alerts that are updated during a particular rotation, so that we can track the performance of the Platform post-change.

Tasks

jennb33 commented 6 days ago

9/11/2024 - talked to Clint, he's in favor of this work, and he is ok with us doing 2 pts of work in Sprint 10 cc: @alyssagallion @LindseySaari