department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
280 stars 195 forks source link

[Discovery] Integrate front-end app errors into OnCall process #410

Closed minafarzad closed 5 years ago

minafarzad commented 5 years ago

User Story

As a VSP team member, I need to be alerted to application errors in a timely and accessible manner so I can respond to errors more efficiently and effectively.

Goal

There are a reasonable number of front-end application errors and they are integrated into VSP's existing OnCall processes and places

Acceptance Criteria

Definition of Done

minafarzad commented 5 years ago

@cvalarida another helpful question in figuring this out -- who needs to be privy to these front end app errors? it's not just the triage team, right?

cvalarida commented 5 years ago

@minafarzad Right, yeah. I think the answer to that question is informed by the Triage team's role on the platform as a whole and how we use the alerts, but my gut reaction is to say yes. If they're actionable alerts to be (ideally) tied into our on-call alerting systems, then they should go to the appropriate on-call engineer, not just Triage.

That said, I feel like that's not the whole story. :point_up: I also don't know what that means yet. :sweat_smile:

cvalarida commented 5 years ago

How Sentry alerting works

We make alerting rules based on criteria and set up alert actions.

Example: This API alert rule

See the Sentry alerting rules documentation for more information.

Alert criteria

I couldn't find documentation listing all the alert criteria, so here they are from Sentry itself:

We also have the ability to throttle the alerts per issue.

Limitations

It doesn't look like we can customize the alert as it gets sent to Slack (or more specifically, all legacy integrations for the project). We can set a custom message in the integration itself, but it doesn't look like we can customize it with issue-specific information aside from what we already have in the tags. (Like @ing certain team members according to some rules we set up.)

Looks like we can also only send it to a single channel.

Ideas

cvalarida commented 5 years ago

To answer the questions:

What can we alert on? See the Alert criteria in the comment above

What should we alert on? I think it makes the most sense to get an alert when an issue is affecting a certain number of users per hour, throttled to hourly alerts.

What is a reasonable frontend error threshold (considering user impact) To start with, I think 100 users per hour.

How can we format the alerts? As I mentioned above, we don't really have much say say in this unless we make some kind of switchboard bot, but even then, we're limited to what Sentry will put in the Slack alert.

How are engineers/on call folks currently using slack to get alerts? can we just integrate into that? The PagerDuty alert @-mentions the on-call engineer for the team the product that's throwing the error belongs to (according to how we have it set up in PagerDuty).

Because we can't integrate with PagerDuty directly right now, we could maybe do something similar by standing up a bot that watches the #vetsgov-sentry-alerts channel, digests incoming alerts, and directs it to the right team.

cvalarida commented 5 years ago

Sentry -> PagerDuty integration status

Looks like it's possible to use the latest version of the sentry-plugins before they release the next tag, but I'm not sure what else that will affect.

Source: https://github.com/getsentry/sentry-plugins/pull/469#issuecomment-492366995

cvalarida commented 5 years ago

I think given all the monitoring we've got for the infrastructure and VA backend dependencies, our goal with the front end application error reporting is to catch application errors specifically. Ideally, we'd be able to take the same approach as we have for our infrastructure alerts and use something like Prometheus to monitor the change over time. This would catch any spikes in errors, indicating something probably happened. I don't think we'll be able to do that with Sentry.

As such, I recommend the following approach (which we should definitely all agree on):

1) Alert to the Slack channel for errors that are affecting more than 100 users per hour 2) Monitor this over a week or two

omgitsbillryan commented 5 years ago

The "switchboard bot" idea is an interesting one. I don't think we'd want/need that in the short run since the messages that are sent to #vetsgov-sentry-alerts are still noisy. If we can smooth those over, then it would be pretty cool to have the bot @mention the engineer who is oncall according to the PD schedule any time a new message from Sentry came into that channel. For a first pass at this, I don't even think it has to be smart and try to @mention the right person/team, let the #oncall engineer figure that out.

WRT Sentry-PagerDuty integration, it seems pretty alpha. We should have some items in the backlog to integrate it, but I don't that at this moment we should rely on that integration.

Is there any way to make our Sentry Alerting rules "declarative"? Similar to how our Terraform code represents our AWS infrastructure, it would be nice if our Sentry Alerts could live as code somewhere rather than just in the Sentry GUI.

Monitor this over a week or two

I agree with this and think maybe it should be even longer. During this time, I think @kfrz , you, and myself should setup a weekly rotation to manually monitor #vetsgov-sentry-alerts, where we respond to every alert in the channel. By, "respond", I mean

Then, the oncall person can respond in a slack-thread to the Sentry message indicating what they did and also,

make a count of these alerts and put them in a spreadsheet for further analysis

cvalarida commented 5 years ago

PagerDuty integration

:+1: I hear you. I don't know that the alerts are super actionable right now anyhow, so I'm good with waiting for now.

Sentry alerts triage rotation

I dig it! We'll probably all be involved at some level regardless of whose turn it is, but rotating who's taking point is a great idea.


In other news, I don't know at what point we should consider this ticket closed... :grimacing:

omgitsbillryan commented 5 years ago

Looks to me that most of the boxes on the acceptance criteria can be marked complete. Most of the relevant information has been written in the comments in this GH issue. All that's left is to make a schedule for a few weeks. Such a doc may or may not be a .md in the triage team github space - I can go either way since it's a supposed temporary thing.

kfrz commented 5 years ago

:heavy_plus_sign: to making Sentry alerts part of the configuration, though I don't see how this could work because events are quite unique...

Yeah, let's draw up a schedule for the arbiter-of-#vetsgov-sentry-alerts on Wednesday and put that into play? Also, maybe renaming that channel is an easy win... #vsp-sentry-alerts?

We're not going to be able to do any effective handing off beyond our team until we know who's responsible for what