department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
98 stars 69 forks source link

Discovery: CMS Team Monitoring #15204

Open EWashb opened 1 year ago

EWashb commented 1 year ago

Background

As the CMS Platform team, we have a responsibility to monitor the Drupal Platform, however product teams have a responsibility to monitor their code and integrations into Drupal. Much like the Platform, we should determine a process that helps product teams understand where their responsibilities are during outages and other incidents. The health of content build is the responsibility of CMS Team, but also is the responsibility of all teams interacting with our api to take that health seriously and pitch in when there is an issue caused by their use of the system.

For example, if Facilities Team code breaks content-build, what monitors should they have in place to address their faulty code. Where does the CMS Team chip in to solve issues. The CMS Team should not responsible for all endpoints, however we should provide guidance and tooling expectations for teams to do troubleshooting of their own code.

User Story or Problem Statement

As a CMS product team, I need to understand the tooling and processes for monitoring and expectations for outages.

As a CMS dev team member, I need appropriate expectations of my time to monitor, triage, troubleshoot and solve issues not inherently caused by Drupal.

How might we determine when the CMS Team is responsible, accountable, consulted, and informed of outages if not caused by fault of the Drupal Platform?

Problems to Solve:

Description or Additional Context

The Platform Crew is determining how to approach teams building on the Platform as well. The Platform Crew provides governance, guidance, and tooling for monitoring, but they are not responsible for all products building on vets-website. There is a line between when a platform is responsible for a fix/failure and when those building in the platform should address issues.

In the past, the CMS Team has been sense and respond for all issues as they come in, but this should be a shared responsibility where appropriate.

Acceptance Criteria

Team

Please check the team(s) that will do this work.

EWashb commented 1 year ago

@ndouglas @BerniXiongA6 This is a bit of an info dump. Let's chat about this at our next leads meeting. I'm interested in chatting about this, especially as the larger platform figures out a similar space for the various VFS teams building on vets-website.