department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

[VFS Monitoring/Alerting]: Enable Slack Integration for Sentry Alerts #55091

Closed kjduensing closed 1 month ago

kjduensing commented 1 year ago

Describe the problem

Monitoring and alerting are a central component of a mature product team. Though it's no fault of Platform teams, we have been significantly impacted by a lack of autonomy and unwieldy nature of the only available alerting process in Sentry.

Alerting on VFS teams is currently limited in the following ways:

  1. We don't have the autonomy to set up alerts in the only alerting mechanism (issue ownership) currently working in Sentry; we must rely on platform support. While platform teams are excellent, relying on y'all to approve simple configuration requests like this places unnecessary burden on platform teams and reduces the autonomy with which VFS teams operate. Both of these drawbacks decrease the velocity of product and platform teams.
  2. The only currently working alerting mechanism in Sentry (issue ownership config) is prolific in the alerts it sends, making it hard to pinpoint issues. It also solely relies on email as its alerting system, which is prone to filters, spam blockers, and accidentally getting buried.

Who will benefit

It's clear that all VFS teams will benefit by having control over their alerting system. However, Platform teams will also see a decrease in requests to configure alerts for VFS teams, reducing the burden on platform support.

Describe your idea

The simplest idea we have is to enable Slack integration in the current http://sentry.vfs.va.gov/ Sentry instance (VSP organization). There may be technical or procedural issues with this approach, however, enabling Slack integration will allow VFS teams to configure their own slack integrations with dedicated channels to proactively handle any incidents that occur.

If Slack integration is unfeasible due to level of effort or a technical/procedural blocker, it looks like there may be some history of using webhooks to sent errors to DataDog and then having DataDog take care of the alerting. However, VFS teams do not have write access to DD, so this would require platform support as well.

Finally, email is an acceptable alerting mechanism, but there are drawbacks vs Slack. If we could figure out how to re-enable sending alerts via email from Sentry, this would be a step in the right direction.

Provide evidence

https://dsva.slack.com/archives/CBU0KDSB1/p1676059425011579

https://dsva.slack.com/archives/CBU0KDSB1/p1678122264462019

https://dsva.slack.com/archives/CBU0KDSB1/p1678721148949179

https://dsva.slack.com/archives/CBU0KDSB1/p1668707111130899

https://dsva.slack.com/archives/CBU0KDSB1/p1659448501942089

https://github.com/department-of-veterans-affairs/va.gov-cms/issues/4696

https://dsva.slack.com/archives/CBU0KDSB1/p1642535108486300?thread_ts=1642187323.408700&cid=CBU0KDSB1

Platform Mission

Other:

No response

va-albers commented 1 year ago

Including @mchelen-gov for prioritization help

jilladams commented 1 year ago

+1 for this feature request.

For Public Websites, we use Sentry for error monitoring for the Forms product which relies on vets-api: Find a Form error report. We have also configured Alerts: Forms errors warning / critical notifications, but they don't do anything. We don't receive a notification or email. Would be great if we could. (cc @wesrowe )

va-albers commented 1 year ago

Hi @little-oddball & team - how can I check the status on this item? Also who do I need to work with to prioritize this? 🙇

va-albers commented 1 year ago

Hi DevOps team. I'm checking in on the status for this item. We have production issues and we are trying to address (in part) by configuring Sentry/Slack integration - what can we do to prioritize (or get a status update) on this work? I'm hoping this is as simple as someone pressing the big "Enable Sentry integration" button, but please let me know if there is more work involved that we can assist with.

jilladams commented 1 year ago

@mchelen @mchelen-gov Hi Mike - back in November with PTEMS Platform transition, I had heard from Ian Hundere (now deactivated) that Sentry didn't have an owner. In the new world of Platform tech teams, can you help us identify which team owns Sentry?

gia-lexa commented 1 year ago

Out sick but am addressing this on my return.

kjduensing commented 1 year ago

New info per @pjhill (summarized, please correct me if I got anything wrong):

Peter brought this request to the Platform DevOps Community of Practice (CoP). From that, they determined the following needs to be clarified before working on this request:

kjduensing commented 1 year ago

From my perspective, Sentry holds an important place in debugging, error tracing, and alerting. I see Sentry and Datadog performing 2 different functions.

Functionality aside, in addition to invalidating our tooling, processes, and documentation, the Sentry -> Datadog migration might represent actual code changes on our side, which slows down delivery of features.

Also, the Platform's customers are a little bit "migration weary" right now. There are several non-optional migrations (EKS & EVSS sunsetting) many VFS teams are having to implement.

Obviously, I prefer getting Sentry/Slack integration working :). But these are just my thoughts, more user research is needed by the Platform to see if the themes above resonate throughout VFS teams.

gia-lexa commented 1 year ago

Thank you all for the information! To clarify for myself, my immediate next steps:

  1. Determine if Sentry is the is our long-term choice versus migrating to DataDog
  2. Determine if there is a current owner/admin/subject matter expert (SME) of Sentry (from previous notes, it's possible there still isn't one, but need to confirm either way)

Related to the Issue #1, what comes up are follow-up questions below:

  1. Who would decide whether Platform will continue to use Sentry for the long-term versus migrating to DataDog?
  2. Does determining this answer need to also involve polling each Platform team to discover their preferred tool moving forward and then alerting the decision makers with that info? a. Is there an org chart for all the Platform teams?
  3. If it's determined we will definitely migrate, what is the cut-off time to still finding value in a Sentry-focused alert, eg if we know we're migrating in 6 months, could it still be valuable to build/turn-on new Sentry alerts until the migration is in effect? a. Possible—in previous comments, it's noted some Sentry alerting exists so execution steps could be replicated rather than built from scratch. In this case, a smaller build/input effort could be worth investing in the remaining time we will use Sentry.

Related to Issue #2:

  1. Post in backend or devops channels to determine if anyone had this position in the past, or still has it, or determine if that SME has since left or if there never was such an assignment
  2. Determine if there is existing internal documentation about Sentry integration (should have authors' names attached to it if in Confluence which could shed light on who might have some knowledge)
  3. If there's code available for current alerts (which were noted above as belonging to Public Websites), ping the author if they're still online here

Will add answers/more notes shortly...

gia-lexa commented 1 year ago

Platform team roll call can be found under subtitle Digital Modernization: https://docs.google.com/document/d/1QmqnVRxLbUEw2Bxc9xmMMhdihjX4SkSgCDcXofkktnc/edit?pli=1

Platform VA Docs doesn't list Sentry as currently in use in the Health and Monitoring Rubric (DataDog is listed): https://depo-platform-documentation.scrollhelp.site/analytics-monitoring/track-your-kpis.

Slack is filled with references to active Sentry usage, it's actively used and referenced. Obtained entry to VSP's Sentry instance.

gia-lexa commented 1 year ago

This Sentry doc goes over how to add and configure Sentry alerts to Slack and is linked in the threads referenced above: https://docs.sentry.io/product/integrations/notification-incidents/slack/

The history linked in the description of this ticket shows ^these instructions not working.

Tracking down any possible Sentry owners/admins.

gia-lexa commented 1 year ago

Re alerting that's already expected to exist, this doc exists on the vfs side: https://vfs.atlassian.net/wiki/spaces/pilot/pages/590774316/Tracking+application+errors+with+Sentry

Will reach out to authors to see if they are, or know, the Sentry owner/admin/SME.

gia-lexa commented 1 year ago

Determined that existing alerting workflow doesn't currently work: https://vfs.atlassian.net/wiki/spaces/pilot/pages/590774316/Tracking+application+errors+with+Sentry

Author is from content but they may know identity of information owner/s?

Spoke with Kevin and Peter to clarify questions raised in the ticket. Kevin confirmed that the above workflow still needs a Sentry/Slack integration to work; he's attempted using that workflow before and it didn't yield Slack/Sentry messages.

Also worth noting that at least two Slack rooms currently do receive Sentry notifications, but they may be accessing a different Slack instance, rather than sentry.vfs.va.gov.

Priorities for tomorrow are ensuring the following questions are finished before diving more into integration details:


Determine and document who owns Sentry (if anyone—according to the epic, this tool may not have an owner anymore.)
  a. If there is no owner, determine if there's a process in place for assigning ownership.
  b. Document ownership

Determine and document the mechanisms for Sentry/Slack alerting in place (forsaken and active versions)
  a. Could we piggyback on any or do any mirror the ask in the epic? 
  b. What's different between the ask in the ticket and the existing mechanisms?
  c. Why do some Slack rooms already receive regular Sentry alerts if a Slack integration isn't already activated?
     1. Are they using a different Sentry instance? 
     2. Determine how to confirm which instance their accessing? 
         a. Check out: #appeals-app-alerts && #appeals-queue-alerts
gia-lexa commented 1 year ago

As DataDog migration is no longer expected to be used as part of Slack/Sentry workflow, the next areas of completion likely include:

  1. confirm the technical steps necessary to complete the Slack/Sentry integration
  2. Implement it (in tandem with a member of DevOps COP, as they are Sentry's configuration owners)
  3. documentation of the implementation and ownership
gia-lexa commented 1 year ago

Need to confirm safety and feasibility with devops COP of possible integration plan.

gia-lexa commented 1 year ago

Hi @kjduensing, we are still in progress on this - we had tabled it shortly while we've been focused on other issues, but are continuing to work on it this sprint.

jilladams commented 1 year ago

@gia-lexa any updates here? We have active incidents every now and then (example) where this push from Sentry would be really really helpful.

gia-lexa commented 1 year ago

Hi @jilladams, absolutely. Most recently, the proposed implementation plan was approved by DevOps COP—they are the owners of the Slack and Sentry configurations.

We're now working through this in-progress ticket, which is a collaboration with DevOps COP to configure and run a closed test of the proposed implementation. DevOps has assigned that work to a member of their team. And I usually update that ticket's status on Thursdays after I attend the DevOps COP meeting.

A note about the closed test—once configured, it will run from 1- 2 weeks while we monitor the output to ensure we're adhering to best practices.

Once we successfully complete that test, we'll flip on a configuration accessible to the appropriate members of vfs as indicated by this upcoming ticket.

(There will also be a final ticket for documentation.)

That's the current overview, I hope it's helpful. Nonetheless, please don't hesitate to let me know if you have any questions about anything I've written here, or if there are any further details that would be helpful for me to share.

gopixelsgo commented 4 months ago

Hi @gia-lexa thanks for the detailed status from a while back! Is it safe to say this ticket can be closed? Doing a bit of cleanup and closing some loops. Thanks!

gopixelsgo commented 2 months ago

Hi @gia-lexa and @little-oddball - just following up on my previous comment. Is it safe to close this ticket? Thanks!

gia-lexa commented 2 months ago

Hi Nicole! Apologies, I missed your first email!

Before I rolled off this team, I was told that a workaround—which included using Datadog—had been implemented and that this ticket would subsequently be closed.

I hope that helps, but please let me know if I can provide you any further details! (Also if it's easier, happy to chat on slack, @gia). Thanks!

gopixelsgo commented 2 months ago

Thank you @gia-lexa! That does help!

gopixelsgo commented 2 months ago

Hi @little-oddball - would you mind closing this ticket if the request is considered complete? Thanks!

gopixelsgo commented 1 month ago

Hi @little-oddball - any update on this item?

little-oddball commented 1 month ago

Hi @little-oddball - any update on this item?

@gopixelsgo - I wasn't assigned to the issue so haven't been tracking it at all.

little-oddball commented 1 month ago

Sentry is an item Platform is working to deprecate so that aspect of this ticket doesn't make sense to invest in moving forward. As far as DD, it should already be working w/ Slack Integration.

I would suggest closing this ticket.