[Discovery] Integrate front-end app errors into OnCall process

department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:

https://depo-platform-documentation.scrollhelp.site/index.html

280 stars 195 forks source link

[Discovery] Integrate front-end app errors into OnCall process #410

Closed minafarzad closed 5 years ago

minafarzad commented 5 years ago

User Story

As a VSP team member, I need to be alerted to application errors in a timely and accessible manner so I can respond to errors more efficiently and effectively.

Goal

There are a reasonable number of front-end application errors and they are integrated into VSP's existing OnCall processes and places

Acceptance Criteria

[ ] Answer the questions: -- what can we alert on? -- what should we alert on? -- what is a reasonable frontend error threshold (considering user impact) -- how can we format the alerts? -- how are engineers/on call folks currently using slack to get alerts? can we just integrate into that?
[ ] Make a recommendation to the Triage team for a solution
[ ] Documentation on the discovery is available in GitHub

Definition of Done

[ ] Attached to a Milestone (when will this be completed?)
[ ] Attached to an Epic (what body of work is this a part of?)
[ ] Labeled with Team (product support, analytics-insights, operations, triage, tools-improvements)
[ ] Labeled with Practice Area (backend, frontend, devops, design, research, product, ia, qa, analytics, call center, research, accessibility, content)
[ ] Labeled with Type (bug, request, discovery, documentation, etc.)

minafarzad commented 5 years ago

@cvalarida another helpful question in figuring this out -- who needs to be privy to these front end app errors? it's not just the triage team, right?

cvalarida commented 5 years ago

@minafarzad Right, yeah. I think the answer to that question is informed by the Triage team's role on the platform as a whole and how we use the alerts, but my gut reaction is to say yes. If they're actionable alerts to be (ideally) tied into our on-call alerting systems, then they should go to the appropriate on-call engineer, not just Triage.

That said, I feel like that's not the whole story. :point_up: I also don't know what that means yet. :sweat_smile:

cvalarida commented 5 years ago

How Sentry alerting works

We make alerting rules based on criteria and set up alert actions.

Example: This API alert rule

Will trigger when an issue appears for the first time
When it does trigger, it'll send a notification to all legacy integrations (Slack)

See the Sentry alerting rules documentation for more information.

Alert criteria

I couldn't find documentation listing all the alert criteria, so here they are from Sentry itself:

An event is seen
An issue is first seen
An issue changes state from resolved to unresolved
An issue changes state from ignored to unresolved
An event's tags match
- E.g. browser.name contains IE
An issue is seen more than times in
An issue is seen by more than users in
An event's
value
- E.g. The message contains EVSSError
An event's level is
- E.g. Greater than or equal to error

We also have the ability to throttle the alerts per issue.

Limitations

It doesn't look like we can customize the alert as it gets sent to Slack (or more specifically, all legacy integrations for the project). We can set a custom message in the integration itself, but it doesn't look like we can customize it with issue-specific information aside from what we already have in the tags. (Like @ing certain team members according to some rules we set up.)

Looks like we can also only send it to a single channel.

Ideas

Make another bot that acts as a switchboard in #vetsgov-sentry-alerts
- We'd write rules for this bot to direct alerts to the appropriate channels when it sees them come up in the all-encompassing alerts channel
- I'm envisioning a system wherein the bot would
- Look at the source tag (or similar on the api)
- Check the code path to see if it belongs to the application or if it's a platform error (not sure if that's possible with these alerts)
- Route the alert to the appropriate team
- Alternatively, the issue ownership could be handled in Sentry via the ownership rules, and the bot would look at that to route the alerts
- Might be better

cvalarida commented 5 years ago

To answer the questions:

What can we alert on? See the Alert criteria in the comment above

What should we alert on? I think it makes the most sense to get an alert when an issue is affecting a certain number of users per hour, throttled to hourly alerts.

What is a reasonable frontend error threshold (considering user impact) To start with, I think 100 users per hour.

How can we format the alerts? As I mentioned above, we don't really have much say say in this unless we make some kind of switchboard bot, but even then, we're limited to what Sentry will put in the Slack alert.

How are engineers/on call folks currently using slack to get alerts? can we just integrate into that? The PagerDuty alert @-mentions the on-call engineer for the team the product that's throwing the error belongs to (according to how we have it set up in PagerDuty).

Because we can't integrate with PagerDuty directly right now, we could maybe do something similar by standing up a bot that watches the #vetsgov-sentry-alerts channel, digests incoming alerts, and directs it to the right team.

I wonder if it would make sense to work on the actual Sentry -> PagerDuty plugin at that point, though, and contribute to the larger open source community while solving our problem

cvalarida commented 5 years ago

Sentry -> PagerDuty integration status

Looks like it's possible to use the latest version of the sentry-plugins before they release the next tag, but I'm not sure what else that will affect.

Source: https://github.com/getsentry/sentry-plugins/pull/469#issuecomment-492366995

cvalarida commented 5 years ago

I think given all the monitoring we've got for the infrastructure and VA backend dependencies, our goal with the front end application error reporting is to catch application errors specifically. Ideally, we'd be able to take the same approach as we have for our infrastructure alerts and use something like Prometheus to monitor the change over time. This would catch any spikes in errors, indicating something probably happened. I don't think we'll be able to do that with Sentry.

As such, I recommend the following approach (which we should definitely all agree on):

1) Alert to the Slack channel for errors that are affecting more than 100 users per hour 2) Monitor this over a week or two

Every day, we'll make a count of these alerts and put them in a spreadsheet for further analysis
This is to determine
- If the alert criteria is "good enough" (Actionable? Noisy? What would this do to our on-call engineers after we get the PagerDuty integration up and running?)
- How to refine it if it's not the right fit
- Any patterns we can take advantage of to weed out unactionable errors
- What kinds of errors we'll miss if we only alert for > 100 users / hour 3) Test out using the latest master from sentry-plugins (as mentioned above) to see if we can get the PagerDuty integration working
This is assuming we have control over what engineers get the page given certain criteria (e.g. source or path)
If we don't have this ability, I'm not sure it makes sense to integrate with PagerDuty (I can probably be convinced otherwise, though) 4) After the monitoring period is over, make a determination whether we can generate meaningful, actionable alerts 5) If so,

omgitsbillryan commented 5 years ago

The "switchboard bot" idea is an interesting one. I don't think we'd want/need that in the short run since the messages that are sent to #vetsgov-sentry-alerts are still noisy. If we can smooth those over, then it would be pretty cool to have the bot @mention the engineer who is oncall according to the PD schedule any time a new message from Sentry came into that channel. For a first pass at this, I don't even think it has to be smart and try to @mention the right person/team, let the #oncall engineer figure that out.

WRT Sentry-PagerDuty integration, it seems pretty alpha. We should have some items in the backlog to integrate it, but I don't that at this moment we should rely on that integration.

Is there any way to make our Sentry Alerting rules "declarative"? Similar to how our Terraform code represents our AWS infrastructure, it would be nice if our Sentry Alerts could live as code somewhere rather than just in the Sentry GUI.

Monitor this over a week or two

I agree with this and think maybe it should be even longer. During this time, I think @kfrz , you, and myself should setup a weekly rotation to manually monitor #vetsgov-sentry-alerts, where we respond to every alert in the channel. By, "respond", I mean

tweaking an alert rule
making code fix
mark as "ignored" or "resolved" in Sentry
create a GH issue for the Sentry issue

Then, the oncall person can respond in a slack-thread to the Sentry message indicating what they did and also,

make a count of these alerts and put them in a spreadsheet for further analysis

cvalarida commented 5 years ago

PagerDuty integration

:+1: I hear you. I don't know that the alerts are super actionable right now anyhow, so I'm good with waiting for now.

Sentry alerts triage rotation

I dig it! We'll probably all be involved at some level regardless of whose turn it is, but rotating who's taking point is a great idea.

In other news, I don't know at what point we should consider this ticket closed... :grimacing:

omgitsbillryan commented 5 years ago

Looks to me that most of the boxes on the acceptance criteria can be marked complete. Most of the relevant information has been written in the comments in this GH issue. All that's left is to make a schedule for a few weeks. Such a doc may or may not be a .md in the triage team github space - I can go either way since it's a supposed temporary thing.

kfrz commented 5 years ago

:heavy_plus_sign: to making Sentry alerts part of the configuration, though I don't see how this could work because events are quite unique...

Yeah, let's draw up a schedule for the arbiter-of-#vetsgov-sentry-alerts on Wednesday and put that into play? Also, maybe renaming that channel is an easy win... #vsp-sentry-alerts?

We're not going to be able to do any effective handing off beyond our team until we know who's responsible for what