Closed scleary1cs closed 2 weeks ago
Even with PD, Azure Alerts for Errors work will need to take place.
Comms, on duty rotation.
Pulling Azure Alerts for Errors work into a different story.
Putting this back into the backlog because we got positive movement on PagerDuty. If, after a week, there is no progress, I'll put this into progress again.
No progress. Pulling this back in.
Sent a couple of messages in Slack that I think we're deciding to use our Slack alerts channel and a Google team calendar at the very least.
Talked with the engineers during engineer block. No one dissented.
We're moving forward with this in #1457.
Spike Goal
Let's pretend we can't use PagerDuty before we want to go live with California. How can we be notified of an issue that PagerDuty would normally notify us about?
The scope is to identify a plausible alternative that works well enough for now.
Timebox
Unknown.
Notes
Our contract (the RFQ specifically) states that product support will be provided during business hours, but we will also respond to emergency outages/issues that occur outside normal business hours.
No central CDC group (OCIO, OPHDST, etc.) has on-call software for others to use.
I've heard, informally, a lot (most?) of software ran by the CDC is not covered after-hours or with on-call software. Hence there being next to no guidance from the CDC. They just aren't in the practice.
We've been pursuing access to ReportStream's PagerDuty, but others at OPHDST (or some other authority in the CDC) has stopped us from getting access. ReportStream has been grandfathered into it. It isn't clear what the grievance is. Is it because some security compliance issue with PagerDuty, someone just doesn't like PagerDuty, someone doesn't like the on-call class of software in general, someone doesn't like after-hours support of software, or someone only likes it if it is absolutely free? We have no idea.
We've been told by our COR that there is nothing in our contract that allows us to bill the CDC for software. That said, after talking with Boris, there is this thing called Other Directed Costs (ODC) in a contract, and given that our contract does state that we should respond after-hours to emergencies, we could try to convince our COR for funds for PagerDuty. We're exploring this in parallel.
Our immediate CDC product owners have been working on an SLA that does not include after-hour responses to emergencies. This conflicts with our contract, but if this is what the CDC wants, we can do it. Plus, the CDC sure isn't doing its part to help us to respond to after-hour emergencies anyway (see the previous two paragraphs).
Regardless, we are not getting access to this pre-existing PagerDuty account anytime soon. What do we do in the meantime? Below outlines some thinking on alternatives.
Requirements
Thoughts
Free Tier PagerDuty
This only supports up to 5 users. We would have more than 5 users in a on-call rotation, so this wouldn't work.
Free Tier Splunk On-Call
This only lasts 14 days, and we're going to need it longer than that.
Lot's of other PagerDuty SASS alternatives
They all seem to only support a subset of our team for free or only last less than a month for free.
Solutions that are self-hosted
This requires some effort to set-up manually.
We could pick a solution that meets all the on-call requirements above, but it wouldn't be free. There is a cost to hosting. There are some free offerings by cloud providers, but there are additional parts that would still cost money to make the solution work completely. Not only that, but this solution would require additional developer time and effort to maintain and our time and effort definitely aren't free.
Even so, maybe this is a way around the limitation our COR states is in our contract, and a way around the being blocked from the pre-existing PagerDuty account. Albeit, it would require setting up accounts with e-mail and phone/SMS SASS services that charge money outside of Azure, and therefore we would want to bill the CDC.
Send notification to our alerts Slack channel
This doesn't require a lot of effort. It's easy to set-up.
There are a lot of caveats to this working after-hours. This option is nearly one-to-one with not responding to after-hours emergencies, which basically aligns with the SLA as last we saw.
When we finally migrate to PagerDuty, all we would need to do is change the e-mail that the Azure alerts are pointing to. Instead e-mail our PagerDuty e-mail address instead of the e-mail address associated with the alerts Slack channel.
I'm drawn to this solution because it is easy to start using and requires little effort to migrate away. I propose we create a subsequent backlog task to iron out the details and build out this solution.
Other Solutions or Ideas?
What other solution am I forgetting?