Apply SRE principles to build a reliable Platform Service

A palceholder for us to capture concepts from the Google's SRE course and share ideas for how they can be applied to the Platform Service

Question: What should the reliability metric be for the Openshift Platform Service ? 99.5% for Silver cluster and 99.99% for the Gold cluster. DXC provides 99.5% uptime SLA for the infrastructure. Unrealistic reliability goals are unattractive and unachievable, so NO to 100% uptime!

Error budget - inverse of reliability, tells how unreliable the service is allowed to be (e.g. 0.1% of requests to fail - hardware failures, maintenance activities - 0.1% unavailability*28 days = 40.32 mins per month = 1 issue to detect and 1 human to investigate and fix). Error budget should be spend on rolling out new features which may break things, planned downtime, inevitable failure of hardware such as network and power outage. Helps strike a balance between innovation and reliability.

High velocity -> lower reliability User happiness = high velocity and high reliability (hard to achieve) Reliability vs increase in $$ (to build extra backups) and decrease in feature velocity (can't release new features)

Plan how to spend the error budget - don't save!!! Use it up but don't overspend. Splurging time to time is ok if there is room for that in the budget. If the error budget is exhausted but there is a new critical feature that needs to be released, a senior stakeholder (Justin) holds a small number of tokens ("silver bullets") they can dispense at their discretion to authorize the "above the budget" spend. Silver bullets don't roll over and cannot be used to fix performance or latency issues, use them only for very important new features.

Question: What is our error budget? With 99.99% it is 52 minutes and 36 seconds of downtime per year. With 99.5% it is Daily: 7m 12s, Weekly: 50m 24s, Monthly: 3h 39m 8s, Yearly: 1d 19h 49m 44s

Question: How do we measure the Platform Service reliability? Openshift API uptime -> to confirm with Steven that :

this is monitored and a human gets alerted if it is down or an automation script will attempt to restart it.
- get a better understanding of how the availability of the API correlates with the availability of nodes in the cluster -> will the API endpoint be up if at least one app node in the cluster is up?

How to decide on the SLO -> find balance between service reliability and engineering features. Keep the error budget burn within the SLO!

Service Level Agreements SLA - external promises to customers that we set consequences for when we don't meet them Measure reliability quantitatively using SLI (indicators) which help us figure out if we are within the target SLO Question: What is the SLO for the Platform Service? (should be higher than the SLA) 99.995% for the Silver Cluster. Question: What is the SLI for the Platform Service? the Openshift API should be available 99.995% of the time.

Other useful metrics: Time to Detect = TTD - time between when users are impacted and when a support person is notified Time to Resolution = TTR - time between when the support person is notified and when the issue is resolved Time to Failure = TTF - how frequently the failure is expected to occur, also often used as Time Between Failures (TBF).

Ways to improve reliability: Reduce TTD and TTR and increase TTF. Question: what are TTD and TTR for the Openshift API? the time between when a health check detects that the API is down and a human gets notified about that, and the time it takes a support person to fix the issue (could take hours if a simple restart doesn't work and RedHat needs to be involved)

Action Plan:

Define SLO and SLI.

BCDevOps / openshift-tools

Apply SRE principles to build a reliable Platform Service #111