Closed minafarzad closed 5 years ago
@cvalarida another helpful question in figuring this out -- who needs to be privy to these front end app errors? it's not just the triage team, right?
@minafarzad Right, yeah. I think the answer to that question is informed by the Triage team's role on the platform as a whole and how we use the alerts, but my gut reaction is to say yes. If they're actionable alerts to be (ideally) tied into our on-call alerting systems, then they should go to the appropriate on-call engineer, not just Triage.
That said, I feel like that's not the whole story. :point_up: I also don't know what that means yet. :sweat_smile:
We make alerting rules based on criteria and set up alert actions.
Example: This API alert rule
See the Sentry alerting rules documentation for more information.
I couldn't find documentation listing all the alert criteria, so here they are from Sentry itself:
browser.name
contains
IE
message
contains
EVSSError
Greater than or equal to
error
We also have the ability to throttle the alerts per issue.
It doesn't look like we can customize the alert as it gets sent to Slack (or more specifically, all legacy integrations for the project). We can set a custom message in the integration itself, but it doesn't look like we can customize it with issue-specific information aside from what we already have in the tags. (Like @
ing certain team members according to some rules we set up.)
Looks like we can also only send it to a single channel.
#vetsgov-sentry-alerts
source
tag (or similar on the api)To answer the questions:
What can we alert on?
See the Alert criteria
in the comment above
What should we alert on? I think it makes the most sense to get an alert when an issue is affecting a certain number of users per hour, throttled to hourly alerts.
What is a reasonable frontend error threshold (considering user impact) To start with, I think 100 users per hour.
How can we format the alerts? As I mentioned above, we don't really have much say say in this unless we make some kind of switchboard bot, but even then, we're limited to what Sentry will put in the Slack alert.
How are engineers/on call folks currently using slack to get alerts? can we just integrate into that?
The PagerDuty alert @
-mentions the on-call engineer for the team the product that's throwing the error belongs to (according to how we have it set up in PagerDuty).
Because we can't integrate with PagerDuty directly right now, we could maybe do something similar by standing up a bot that watches the #vetsgov-sentry-alerts
channel, digests incoming alerts, and directs it to the right team.
Looks like it's possible to use the latest version of the sentry-plugins
before they release the next tag, but I'm not sure what else that will affect.
Source: https://github.com/getsentry/sentry-plugins/pull/469#issuecomment-492366995
I think given all the monitoring we've got for the infrastructure and VA backend dependencies, our goal with the front end application error reporting is to catch application errors specifically. Ideally, we'd be able to take the same approach as we have for our infrastructure alerts and use something like Prometheus to monitor the change over time. This would catch any spikes in errors, indicating something probably happened. I don't think we'll be able to do that with Sentry.
As such, I recommend the following approach (which we should definitely all agree on):
1) Alert to the Slack channel for errors that are affecting more than 100 users per hour 2) Monitor this over a week or two
master
from sentry-plugins
(as mentioned above) to see if we can get the PagerDuty integration workingsource
or path
)The "switchboard bot" idea is an interesting one. I don't think we'd want/need that in the short run since the messages that are sent to #vetsgov-sentry-alerts
are still noisy. If we can smooth those over, then it would be pretty cool to have the bot @mention the engineer who is oncall according to the PD schedule any time a new message from Sentry
came into that channel. For a first pass at this, I don't even think it has to be smart and try to @mention the right person/team, let the #oncall engineer figure that out.
WRT Sentry-PagerDuty integration, it seems pretty alpha. We should have some items in the backlog to integrate it, but I don't that at this moment we should rely on that integration.
Is there any way to make our Sentry Alerting rules "declarative"? Similar to how our Terraform code represents our AWS infrastructure, it would be nice if our Sentry Alerts could live as code somewhere rather than just in the Sentry GUI.
Monitor this over a week or two
I agree with this and think maybe it should be even longer. During this time, I think @kfrz , you, and myself should setup a weekly rotation to manually monitor #vetsgov-sentry-alerts, where we respond to every alert in the channel. By, "respond", I mean
Then, the oncall person can respond in a slack-thread to the Sentry message indicating what they did and also,
make a count of these alerts and put them in a spreadsheet for further analysis
:+1: I hear you. I don't know that the alerts are super actionable right now anyhow, so I'm good with waiting for now.
I dig it! We'll probably all be involved at some level regardless of whose turn it is, but rotating who's taking point is a great idea.
In other news, I don't know at what point we should consider this ticket closed... :grimacing:
Looks to me that most of the boxes on the acceptance criteria can be marked complete. Most of the relevant information has been written in the comments in this GH issue. All that's left is to make a schedule for a few weeks. Such a doc may or may not be a .md in the triage team github space - I can go either way since it's a supposed temporary thing.
:heavy_plus_sign: to making Sentry alerts part of the configuration, though I don't see how this could work because events are quite unique...
Yeah, let's draw up a schedule for the arbiter-of-#vetsgov-sentry-alerts on Wednesday and put that into play? Also, maybe renaming that channel is an easy win... #vsp-sentry-alerts?
We're not going to be able to do any effective handing off beyond our team until we know who's responsible for what
User Story
As a VSP team member, I need to be alerted to application errors in a timely and accessible manner so I can respond to errors more efficiently and effectively.
Goal
There are a reasonable number of front-end application errors and they are integrated into VSP's existing OnCall processes and places
Acceptance Criteria
Definition of Done
product support
,analytics-insights
,operations
,triage
,tools-improvements
)backend
,frontend
,devops
,design
,research
,product
,ia
,qa
,analytics
,call center
,research
,accessibility
,content
)bug
,request
,discovery
,documentation
, etc.)