2024 Q3 Platform Product and Platform Support: Discovery on LOE to Implement Actionable Alerts Process (dupe 1)

This ticket is a dupe of 88370, and was duped for Sprint 10 to finish the work.

Summary

As the Platform environment grows in complexity, it's becoming increasingly crucial to track all alert errors during the Support process. Platform Product team had done an informal review of what setting up a robust actionable alerts system would take, and we determined that the ROI on those efforts isn't of high enough value. Our alerting mechanism notifies when there are potential issues, but lacks the capability to provide actionable insights or automated responses.

It is the Platform Product team's thought that integrating this into the current Support process would be most effective, given that that's how our developers have been addressing the alerts in real time (IRT).

When there are alerts that have been going off continually, Tier 1 support usually involves Tier 2, depending on the thresholds that have been set for the alerts. It is important, when implementing this "IRT" strategy, that the process is standardized and shared with all resources as they cycle through the Support rounds.

The Platform Product team will do analysis on the current thresholds to ensure that the alerts are accurate with their current configurations, and will adjust and document the alerts as necessary while doing this analysis. This information would then be shared with Support team, who will standardize how the alerts get fixed during Support Rotation. Support team would be responsible for updating all resources who are on Support Tier 2 FE, BE and DevOps. The Platform Product team would require documentation of alerts that are updated during a particular rotation, so that we can track the performance of the Platform post-change.

Tasks

[ ] Platform Product team will perform analysis on the current thresholds
[ ] Platform Product team will ensure that the alerts are accurate with their current configurations
[ ] Provide insight as to why the alerted value was chosen. Example: This {{DD ALERT}} was changed from 95% to 40% because production broke when {{DD ALERT}} was around 40%.
[ ] Platform Product team will adjust and document the alerts as necessary while doing this analysis
[ ] Support team will integrate this process into Tier 1 and Tier 2 support

department-of-veterans-affairs / va.gov-team

2024 Q3 Platform Product and Platform Support: Discovery on LOE to Implement Actionable Alerts Process (dupe 1) #92546

Summary

Tasks