department-of-veterans-affairs / caseflow

Caseflow is a web application that enables the tracking and processing of appealed claims at the Board of Veterans' Appeals.
Other
53 stars 17 forks source link

[Retro Item] - Proposal for system maintenance cost accounting #3346

Closed sunil-sadasivan closed 6 years ago

sunil-sadasivan commented 6 years ago

As an action item from the dispatch retro, it's important for us to have a simple but efficient process for planning and understanding the maintenance cost of the systems we build. As a developer, everything I build will have some maintenance cost associated with it. It's important to set up objectives around what would be the ideal maintenance cost, and measure that goal against the result of the actual maintenance cost. Clarity around this allows all stakeholders to understand and better plan for the systems we build.

This is a task to write up a proposal for this process and gather feedback on among the team to experiment with.

sunil-sadasivan commented 6 years ago

Background

Whenever building new products, there are trade-off decisions we all make every day on what to spend our time on. Each developer is faced with finding a balance between maintenance and active development. Being optimal with this balance is especially critical in our environment as we're often integrated with external services that are less than perfect. It's important to have a simple yet helpful way to account for the maintenance cost associated with each service/product we build. This is especially key to ensure all stakeholders (product/design/data/devs and customers) are on the same page.

This discussion is inspired by the Embracing Risk and Service Level Objectives chapter in Google's SRE book, and extrapolating to engineering as a whole (vs. simply site reliability)

Proposal

At the time of planning a new service or product, we should establish what are key indicator metrics (Service Level Indicators) that are critical for the service. Once this is defined, we should establish what our objective range (Service Level Objectives) is for the SLIs of the service.

Establishing and keeping track of SLOs allows us to understand our desired and acceptable criteria for the service/product. We can keep track of this in a spreadsheet or github doc/table. Recording the SLOs for each service can be helpful when we move on to build other services in multiple ways.

Potential Examples of SLIs

Often times for services SLIs consist of uptime, response latency, service error rates. In our environment, it's possible we could do something rudimentary as in days per month without an incident (which could work well for something like the dispatch cronjobs). It is important to choose clear, but simple to understand SLIs and not to choose too many per service.

Feel free to read the SLO chapter for a more detailed view on how SLI/SLOs are defined.

Error Budgeting

With defined SLOs for each product and service, we can essentially set up a budget and prioritization/negotiation framework around maintainability. When an existing service is failing to meet its SLO budget, we can either re-prioritize to focus on improving the service to hit it's SLO, or reduce the SLO.

joofsh commented 6 years ago

Factors to consider when determining the maintenance cost of a product we build:

External Dependencies

Business Rules

Uptime Requirements

evankroske commented 6 years ago

What do you mean by "maintenance"? The SLO system is good for prioritizing reliability work, but it doesn't apply to paying down technical debt or doing other non-feature work

sunil-sadasivan commented 6 years ago

@evankroske, that's a good point. I should be clear SLOs help us understand when we need to prioritize reliability/eliminating toil. Using SLOs could lead us to prioritize focusing on technical debt but it cannot be assumed that's always the case. I'll re-adjust the proposal to be clear here.

By maintenance, I mean: A conscious effort/active focus to ensure a service is running properly. I believe Google defines this as 'Toil'