Closed shanear closed 7 years ago
Here's the Prod Squad survival kit. Everyone assigned to Prod Squad must have the following ready.
AWS Console Access
Prod Access
SSH Access (and have shortcut configured on their laptop)
Know how to quickly verify that the offending application is healthy
Know how to find logs quickly (through awslogs)
Know how to get to Sentry and Grafana
Know the basics setup for each app
For app errors (e.g. 500s), there needs to be github issues filed for tracibility
For non-app errors, it is usually more serious (revproxy downtime, CSR failure, VA backend failure), they need to read up on escalation plan and know when to broadcast it to devOps team.
My experience in the past 3-4 weeks with Prod Squad is that we are not preparing our teammates correctly.
in terms of who can be on call, contract says “Normal Business Hours” agreed with Matt to give contractors an opt out if its out of hours
Agree with @ToddStumpf that we can set the rotation up so contractors are only paged during business hours, and have DS oncall extended hours
I think the best way to learn is by doing it. Whenever there is an issue, Alan and Shane should try and not get involved and let whoever is on call to figure it out first. However, some developers have a very minimal devops experience. We could set up pager duty in a way that these developers have a more experienced secondary on call who can guide them through.
@ariperez what is considered "Normal Business Hours"?
1) Create a github issue or match the sentry error to an existing issue 2) Add the link to sentry in the issue 3) Add the link to the issue in sentry 4) resolve the issue in sentry 5) move the issue to current sprint and assign to tech lead of that swim lane. (and ping them on slack in #appeals-engineering)
Add urls and cheat sheet to prod-squad playbook
postpone discussion on how to communicate with support team on issues. Until then, follow your heart ❤️
do recurrences of Sentry errors cause the issue to become unresolved? (testing with this one: https://sentry.ds.va.gov/department-of-veterans-affairs/caseflow-certification/issues/121/)
Pager Duty Schedule: Primary -> 5m -> Secondary -> 30m -> Tech Leads
Resolve as long as you know what you are doing. Post on #appeals-devops if you need help.
Anything we can automate, the happier we will be -- having process that isn't automated, when it's around incidents should be examined for diminishing returns -- is it better to be handling the process? or fixing the incident?
So, it's good to have the playbooks available in some always-available-place -- PD itself uses S3. That may be something we want to consider. The playbooks will undoubtedly having IPs and machine names, so we need to keep 'em secure-ish. Is there anyplace better than S3 to place 'em off hand?
The ultimate form of cheat-sheet for plays should be a link to the matching play in the playbook in the body of each PD alert (either in the description, or details, or both). If we can keep the playbook next to the prometheus alerts, that's the easiest way to keep the plays next to the alerts -- add an alert? where's the play?
FWIW, PD itself has a pretty good discussion of oncall processes: https://github.com/PagerDuty/incident-response-docs
Changes made to Prod squad doc: https://github.com/department-of-veterans-affairs/appeals-pm/pull/1268
closing
More changes made to the prod squad doc. Closing again. :smile:
Prod squad has been going down for almost a month now, and it looks like there are some ways we can better equip everyone to be ready when they're on the squad.
Let's discuss and make the appropriate improvements.
Action Items
When a new issue comes up in Sentry
1) Create a github issue or match the sentry error to an existing issue 2) Add the link to sentry in the issue 3) Add the link to the issue in sentry 4) resolve the issue in sentry 5) move the issue to current sprint and assign to tech lead of that swim lane. (and ping them on slack in #appeals-engineering)
Add urls and cheat sheet to prod-squad playbook
postpone discussion on how to communicate with support team on issues. Until then, follow your heart ❤️
do recurrences of Sentry errors cause the issue to become unresolved? (testing with this one: https://sentry.ds.va.gov/department-of-veterans-affairs/caseflow-certification/issues/121/)
Pager Duty Schedule: Primary -> 5m -> Secondary -> 30m -> Tech Leads
Resolve as long as you know what you are doing. Post on #appeals-devops if you need help.