[Retro Item] - Proposal for system maintenance cost accounting

sunil-sadasivan commented 6 years ago

As an action item from the dispatch retro, it's important for us to have a simple but efficient process for planning and understanding the maintenance cost of the systems we build. As a developer, everything I build will have some maintenance cost associated with it. It's important to set up objectives around what would be the ideal maintenance cost, and measure that goal against the result of the actual maintenance cost. Clarity around this allows all stakeholders to understand and better plan for the systems we build.

This is a task to write up a proposal for this process and gather feedback on among the team to experiment with.

sunil-sadasivan commented 6 years ago

Background

Whenever building new products, there are trade-off decisions we all make every day on what to spend our time on. Each developer is faced with finding a balance between maintenance and active development. Being optimal with this balance is especially critical in our environment as we're often integrated with external services that are less than perfect. It's important to have a simple yet helpful way to account for the maintenance cost associated with each service/product we build. This is especially key to ensure all stakeholders (product/design/data/devs and customers) are on the same page.

This discussion is inspired by the Embracing Risk and Service Level Objectives chapter in Google's SRE book, and extrapolating to engineering as a whole (vs. simply site reliability)

Proposal

At the time of planning a new service or product, we should establish what are key indicator metrics (Service Level Indicators) that are critical for the service. Once this is defined, we should establish what our objective range (Service Level Objectives) is for the SLIs of the service.

Establishing and keeping track of SLOs allows us to understand our desired and acceptable criteria for the service/product. We can keep track of this in a spreadsheet or github doc/table. Recording the SLOs for each service can be helpful when we move on to build other services in multiple ways.

This allows us to understand how well in actuality our service is matching up to our original expectations.
If a service is failing to meet it's original SLO goal, it's clearly documented and provides all other (often times, non-technical) stakeholders with clarity that this service needs attention and focus.
This provides other engineers a clear understanding of the reliability of another service. This allows others to understand the risks if they are considering to add that service as a dependency. (eg. service A has a SLO of 80% uptime, if service B wants to integrate with service A, service B would have an SLO at best of 80% uptime)
Recording SLOs gives us more of a map of how best to react to incidents. It's natural to feel pressure and possibly overreact when an incident occurs. If a service has general objectives outlined, we may recognize that the incident may not be as critical if the service is still within it's SLO.
Measuring actual results against original SLOs can allow us to understand where we are in the balance of maintenance vs building new features/products. This can help us better understand work balance and staffing needs.

Potential Examples of SLIs

Often times for services SLIs consist of uptime, response latency, service error rates. In our environment, it's possible we could do something rudimentary as in days per month without an incident (which could work well for something like the dispatch cronjobs). It is important to choose clear, but simple to understand SLIs and not to choose too many per service.

Feel free to read the SLO chapter for a more detailed view on how SLI/SLOs are defined.

Error Budgeting

With defined SLOs for each product and service, we can essentially set up a budget and prioritization/negotiation framework around maintainability. When an existing service is failing to meet its SLO budget, we can either re-prioritize to focus on improving the service to hit it's SLO, or reduce the SLO.

joofsh commented 6 years ago

Factors to consider when determining the maintenance cost of a product we build:

External Dependencies

Number of external dependencies
Uptime of each dependency
Number of API endpoints we integrate with
Stability of API endpoints. Do they following proper versioning conventions? Do they make arbitrary changes to how the endpoint works?
Number of API endpoints we write data to. We ran into a ton of problems with Dispatch because of how complex the data structure was and all the unclear/unknown data validations.
How well the API endpoints are documented

Business Rules

Rate of change of business rules. The business process around certification is relatively stable. The business process around dispatch is actively being adjusted
Complexity of business rules. Reader has very simple business rules (load a PDF), in contrast Dispatch has very complex business rules for routing appeals. We could attempt to measure this in code logic pathway permutations (number of conditionals?)

Uptime Requirements

Evaluate "business dependency" the product has for appeals processing. For example, when dispatch goes down the processing of appeals completely with decisions completely stops. For other products like Reader, we may be able to support more downtime as an attorney can revert to using the VBMS UI.

evankroske commented 6 years ago

What do you mean by "maintenance"? The SLO system is good for prioritizing reliability work, but it doesn't apply to paying down technical debt or doing other non-feature work

sunil-sadasivan commented 6 years ago

@evankroske, that's a good point. I should be clear SLOs help us understand when we need to prioritize reliability/eliminating toil. Using SLOs could lead us to prioritize focusing on technical debt but it cannot be assumed that's always the case. I'll re-adjust the proposal to be clear here.

By maintenance, I mean: A conscious effort/active focus to ensure a service is running properly. I believe Google defines this as 'Toil'

department-of-veterans-affairs / caseflow