CDCgov / trusted-intermediary

Bringing together healthcare providers by reducing the connection burden.
Apache License 2.0
11 stars 5 forks source link

PagerDuty Alternatives #1372

Closed scleary1cs closed 2 weeks ago

scleary1cs commented 1 month ago

Spike Goal

Let's pretend we can't use PagerDuty before we want to go live with California. How can we be notified of an issue that PagerDuty would normally notify us about?

The scope is to identify a plausible alternative that works well enough for now.

Timebox

Unknown.

Notes

Our contract (the RFQ specifically) states that product support will be provided during business hours, but we will also respond to emergency outages/issues that occur outside normal business hours.

No central CDC group (OCIO, OPHDST, etc.) has on-call software for others to use.

I've heard, informally, a lot (most?) of software ran by the CDC is not covered after-hours or with on-call software. Hence there being next to no guidance from the CDC. They just aren't in the practice.

We've been pursuing access to ReportStream's PagerDuty, but others at OPHDST (or some other authority in the CDC) has stopped us from getting access. ReportStream has been grandfathered into it. It isn't clear what the grievance is. Is it because some security compliance issue with PagerDuty, someone just doesn't like PagerDuty, someone doesn't like the on-call class of software in general, someone doesn't like after-hours support of software, or someone only likes it if it is absolutely free? We have no idea.

We've been told by our COR that there is nothing in our contract that allows us to bill the CDC for software. That said, after talking with Boris, there is this thing called Other Directed Costs (ODC) in a contract, and given that our contract does state that we should respond after-hours to emergencies, we could try to convince our COR for funds for PagerDuty. We're exploring this in parallel.

Our immediate CDC product owners have been working on an SLA that does not include after-hour responses to emergencies. This conflicts with our contract, but if this is what the CDC wants, we can do it. Plus, the CDC sure isn't doing its part to help us to respond to after-hour emergencies anyway (see the previous two paragraphs).

Regardless, we are not getting access to this pre-existing PagerDuty account anytime soon. What do we do in the meantime? Below outlines some thinking on alternatives.

Requirements

  1. Free. If we included items that weren't free, we'd just pay for PagerDuty. I feel this is the most important requirement. The subsequent requirements are also important, but can be skimped on depending on the situation.
  2. Does it notify us in a manner that wakes us up in the middle of the night like PagerDuty can?
  3. Does it notify us in an alarming way that it gets our attention immediately like PagerDuty can?
  4. Does it support notification escalation? I.e. it will notify a backup if someone doesn't respond soon enough.
  5. Does it support on-call schedule rotations?

Thoughts

Free Tier PagerDuty

This only supports up to 5 users. We would have more than 5 users in a on-call rotation, so this wouldn't work.

Free Tier Splunk On-Call

This only lasts 14 days, and we're going to need it longer than that.

Lot's of other PagerDuty SASS alternatives

They all seem to only support a subset of our team for free or only last less than a month for free.

Solutions that are self-hosted

This requires some effort to set-up manually.

We could pick a solution that meets all the on-call requirements above, but it wouldn't be free. There is a cost to hosting. There are some free offerings by cloud providers, but there are additional parts that would still cost money to make the solution work completely. Not only that, but this solution would require additional developer time and effort to maintain and our time and effort definitely aren't free.

Even so, maybe this is a way around the limitation our COR states is in our contract, and a way around the being blocked from the pre-existing PagerDuty account. Albeit, it would require setting up accounts with e-mail and phone/SMS SASS services that charge money outside of Azure, and therefore we would want to bill the CDC.

Send notification to our alerts Slack channel

This doesn't require a lot of effort. It's easy to set-up.

  1. It is free to send an e-mail to our Slack channel.
  2. It doesn't exactly notify us in the middle of the night. Even with keeping alerts and notifications on throughout the night, I don't think a Slack notification sound will wake someone up. Maybe with enough work, you could change the Slack sound to be a alarming sound that would wake you up.
  3. It could get our attention immediately in the middle of the day depending on how each person sets up Slack notifications and how much they pay attention to Slack notifications. One would hope that the on-call engineers are paying attention to Slack notifications.
  4. It doesn't necessarily support notification escalation, but others will be able to view the notification and act on it if necessary. There's also the other buddy on-call engineer.
  5. It definitely doesn't support on-call rotations. This could be handled through a new Google team calendar with repeating events.

There are a lot of caveats to this working after-hours. This option is nearly one-to-one with not responding to after-hours emergencies, which basically aligns with the SLA as last we saw.

When we finally migrate to PagerDuty, all we would need to do is change the e-mail that the Azure alerts are pointing to. Instead e-mail our PagerDuty e-mail address instead of the e-mail address associated with the alerts Slack channel.

I'm drawn to this solution because it is easy to start using and requires little effort to migrate away. I propose we create a subsequent backlog task to iron out the details and build out this solution.

Other Solutions or Ideas?

What other solution am I forgetting?

scleary1cs commented 1 month ago

Even with PD, Azure Alerts for Errors work will need to take place.

scleary1cs commented 1 month ago

Comms, on duty rotation.

scleary1cs commented 1 month ago

Pulling Azure Alerts for Errors work into a different story.

halprin commented 1 month ago

Putting this back into the backlog because we got positive movement on PagerDuty. If, after a week, there is no progress, I'll put this into progress again.

halprin commented 1 month ago

No progress. Pulling this back in.

scleary1cs commented 1 month ago

Feedback link in slack.

halprin commented 1 month ago

Sent a couple of messages in Slack that I think we're deciding to use our Slack alerts channel and a Google team calendar at the very least.

halprin commented 1 month ago

Talked with the engineers during engineer block. No one dissented.

halprin commented 1 month ago

We're moving forward with this in #1457.