department-of-veterans-affairs / caseflow

Caseflow is a web application that enables the tracking and processing of appealed claims at the Board of Veterans' Appeals.
Other
54 stars 19 forks source link

Prod Squad Improvements #1352

Closed shanear closed 7 years ago

shanear commented 7 years ago

Prod squad has been going down for almost a month now, and it looks like there are some ways we can better equip everyone to be ready when they're on the squad.

Let's discuss and make the appropriate improvements.

Action Items

Add urls and cheat sheet to prod-squad playbook

askldjd commented 7 years ago

Here's the Prod Squad survival kit. Everyone assigned to Prod Squad must have the following ready.

My experience in the past 3-4 weeks with Prod Squad is that we are not preparing our teammates correctly.

ariperez-gov commented 7 years ago

in terms of who can be on call, contract says “Normal Business Hours” agreed with Matt to give contractors an opt out if its out of hours

Agree with @ToddStumpf that we can set the rotation up so contractors are only paged during business hours, and have DS oncall extended hours

anyakhvost commented 7 years ago

I think the best way to learn is by doing it. Whenever there is an issue, Alan and Shane should try and not get involved and let whoever is on call to figure it out first. However, some developers have a very minimal devops experience. We could set up pager duty in a way that these developers have a more experienced secondary on call who can guide them through.

anyakhvost commented 7 years ago

@ariperez what is considered "Normal Business Hours"?

shanear commented 7 years ago

Add urls and cheat sheet to prod-squad playbook

ToddStumpf commented 7 years ago

Anything we can automate, the happier we will be -- having process that isn't automated, when it's around incidents should be examined for diminishing returns -- is it better to be handling the process? or fixing the incident?

So, it's good to have the playbooks available in some always-available-place -- PD itself uses S3. That may be something we want to consider. The playbooks will undoubtedly having IPs and machine names, so we need to keep 'em secure-ish. Is there anyplace better than S3 to place 'em off hand?

The ultimate form of cheat-sheet for plays should be a link to the matching play in the playbook in the body of each PD alert (either in the description, or details, or both). If we can keep the playbook next to the prometheus alerts, that's the easiest way to keep the plays next to the alerts -- add an alert? where's the play?

FWIW, PD itself has a pretty good discussion of oncall processes: https://github.com/PagerDuty/incident-response-docs

shanear commented 7 years ago

Changes made to Prod squad doc: https://github.com/department-of-veterans-affairs/appeals-pm/pull/1268

closing

nickheiner-usds commented 7 years ago

More changes made to the prod squad doc. Closing again. :smile: