department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 203 forks source link

Outage caused by certificate change during SLA hours #1261

Closed wunderhund closed 5 years ago

wunderhund commented 5 years ago

Please read the Triage Rules of Engagement for instructions on what types of issue should be submitted using this template.

Status

UNRESOLVED

Severity Analysis

Estimate of how critical this bug is

User Impact

Description

Proposed Solution

Timeline

Background: the old lower environment certs for dev-api.va.gov, staging-api.gov, and so on will expire on August 30, 2019, so they needed to be replaced with new certificates.

July 15

dev-devel INC6949035

dev INC6948888

staging-api INC6949284

staging-devel INC6949350

staging INC6949112


- David Young from GateWay Ops updated INC6935855 to say `currently awaiting the password from the requester in an encrypted format.`
##### Aug 8
- Ryan Watson supplied the new dev-api.va.gov cert and its password to the Gateway Ops team.
##### Aug 12
- David Young from Gateway Ops reports that he had pre-staged the new certificates for dev-api.va.gov on the F5 load balancers that serve the Gateway West and Gateway East TIC internet gateways.
##### August 20
- Frank Hopkins from Gateway Ops posts in the ticket to report that the maintenance is scheduled for 8/21/19 at 5pm ET. He also reports that a meeting invite was sent out, but it's not clear to whom.
##### August 21
- `5:20pm ET`: Zachary Jones from Gateway Ops reports in the ticket that the certificate has been updated successfully.
- `6:03pm ET`: Shawnee Petrosky sends a message to Craig Butler on the Lighthouse API Platform Infrastructure team reporting that a certificate changed caused an outage for the Health APIs from 08/21/2019 05:01:59 PM ET to 08/21/2019 05:15:59 PM ET. The error received for dev-api.va.gov was:```
Our Pingdom monitoring is alarming on an invalid certificate and running dev-api.va.gov through SSL Labs is noting that "This server's certificate chain is incomplete. Grade capped to B."```
- `6:39pm ET`: Craig Butler acknowledges this message and begins troubleshooting.
- `7:41pm ET`: Craig Butler identifies that the problem is likely a certificate change by the Gateway Ops team due to the upcoming expiration of the old dev-api.va.gov certificate.
- `8:07pm ET`: Craig Butler confirms to Shawnee Petrosky that the public-facing dev-api.va.gov certificate was changed by the Gateway Ops team during the period in question and that the ServiceNow ticket documenting the work is INC6935855.
cvalarida commented 5 years ago

Looks like this was an on-call incident, not a triage incident. (The naming is confusing, I admit.)

@ricetj Is this a thing for the operations team to take over?

ricetj commented 5 years ago

Looks like this was an on-call incident, not a triage incident. (The naming is confusing, I admit.)

@ricetj Is this a thing for the operations team to take over?

Hello @cvalarida we are short-handed right now, so I'm not sure If we could address this anytime soon. So on-call incidents that have unresolved issues are not run through triage?

alexpappasoddball commented 5 years ago

@omgitsbillryan Can you take a look at this and provide feedback