RFC: Implement Dependency Rotation

artsy / README

:wave: - The documentation for being an Artsy Engineer

Creative Commons Attribution 4.0 International

1.1k stars 120 forks source link

RFC: Implement Dependency Rotation #401

Closed kajatiger closed 2 years ago

kajatiger commented 3 years ago

Proposal:

I propose having a dependency update responsibility rotation in each team, so that people know they will have to check for bot PRs and merge them asap/ fix if something breaks with a new version. This could easily be implemented in a slack reminder in the team channel that has a list of people in a certain order. The person who was having the responsibility for last week will then handover explicitely to the next person, by answering to the reminder in slack with a ping to the next person of the list.

Reasoning

People tend to leave dependency bot PRs open and forget to merge them ⏳
A lot of our repositories are very out of date & having security issues because of old dependency versions ⚠️

Exceptions:

✨ None ✨

Additional Context:

Split from another RFC about dependency updating tools: RFC: Automate dependency updates with Depfu

How is this RFC resolved?

All teams propose a list of repositories that they will have to maintain and implement a proper reminder method for the rotation. The teams are free to choose which frequency (e.g. every sprint) fits them best and how and when to handover (e.g. sprint planning/retro). The teams also lay down the rules included in the dependency responsibility. Example questions for rules: Is the person in charge responsible to merge bot PRs or are they supposed to look through deps themselves? Do they write tickets for bigger updates that require fixes? Is there a guideline within the team on which updates can be automerged and what needs a review? etc ...

chr-tatu commented 3 years ago

Nice initiative to upgrade dependencies! Did you know about this page: Project List? For each project/repo there is mentioned a Team. I imagine that's the team owning the project.

kajatiger commented 3 years ago

Oh interesting! The list looks like a good starting point. However I feel like it is a bit out of date. Exchange for example has been changed a lot lately by PX, it is however listed as a platform repo.

mdole commented 3 years ago

I like this idea! A few things I think this RFC could clearly lay out that might help it to be adopted and acted upon:

How do we handle getting up-to-date in repos that have not been updated in a long time?
Are all repositories considered equal? Most teams have "high-priority" and "low-priority" repositories, and keeping small repositories that are rarely used up-to-date may not feel like a good use of time. Take the PX team for example: Volt should probably be kept very up-to-date because it's an important service that is used a lot. However, a tool like Admin Metadata, which is internal-only, probably doesn't warrant the same level of scrutiny and time expenditure.
Different updates might take radically different amounts of time, and we might not know at the beginning of a sprint how much time the dependency-updater-person will have to spend on these updates. E.g. if a new major version of an important dependency is released 2 days into a sprint, they might spend several hours or even a day or two. How should PMs and engineers account for the variability in time required?
Does this RFC cover things like Ruby or Node versions, where updating might be labor-intensive and also when it's not obvious when an update is necessary? For example, if an app is on ruby-version 2.6.7, would we expect the dependency-updater-person to update to 2.6.8? To 2.7.4? To not update it at all since 2.6.8 is still quite recent?

One possible solution that might help address most of the above questions: what if we proposed that the dependency-updater-person spend a fixed amount of time per-sprint on updates, say 2 hours, and that anything over that time would be either ticketed or handed off? That way we know what to expect going into a sprint, and there's a "release valve" for updates that turn out to be really big. Also, it would hopefully allow us to spend at least some time on smaller "lower-priority" repositories - if all of our big repos are up-to-date, we could use that 2 hours to do upgrades on other repos.

Also, because this RFC would have an impact on overall engineering bandwidth, I think it would be worth getting buy-in from leadership. Maybe we can tag Sam & Marisa once the RFC is clearly scoped and ready to resolve!

kajatiger commented 3 years ago

Thanks for your great questions @mdole 😺

How do we handle getting up-to-date in repos that have not been updated in a long time?

This is technical debt indeed that we need to reduce before having a non time consuming routine. My short answer would be: We just need to spend some time on this and prioritize it for a while until everything is in a better maintainable state.

Are all repositories considered equal?

I would say yes, they are all equal, because even seemingly "unimportant" repositories can actually be a security issue for us if they are not porperly updated. Any part of the system is vulnerable and should be treated. There are however unmaintained and inactive repositories that need to be removed/deactivated anyway in order to have a better overview.

Different updates might take radically different amounts of time ...

Yes very true. Nobody should be stuck with updates and unable to do their sprint work. But the thing is that this should be part of sprint work and can also be estimated and groomed and put into tickets if it is a more complex thing. In my experience from previous companies it is like this: The more updates are prioritized ➡️ the more frequent time you spend on updates ➡️ the less time consuming they get. The less care is put into updates ➡️ the more time consuming and blocking and dangerous they get.

This is why having a routine is so important.

Does this RFC cover things like Ruby or Node versions ..?

Yes. I think these are crucial updates and should be part of a healthy development routine as well. I would recommend always developing on the last stable version of a language/ framework since it makes updating to the next one so much easier and also the new versions seem to be more perfomant and stable.

I think in the beginning we should spend more than 2 hours per sprint on this and really get everything up to date as soon as possible and then after being in a state where we are everywhere on the newest version time & effort will decrease anyway. I think this is something that is a very important foundation to the work on the features and building on a weak foundation is not a good idea.

damassi commented 3 years ago

Does this RFC cover things like Ruby or Node versions

Note that this is generally handled by the platform team, and we've run into some fires in the past after upgrading versions. Additionally, version upgrades are subtly tied into our CI tooling, so often upgrading a repo's node version might involve globally updating our shared orb configurations as well.

As major dependency updates and platform version upgrades are non-trivial and intimidating for many product engineers, one suggestion would be to limit the scope of this dependency rotation to patch and minor upgrades for now, get the flow up and running, and then consider a parallel path for major version changes that is more scoped.

With major version changes, often, for many different reasons, it's not in a repos best-interest to always be on the latest major version. We could have patches applied to the lib, or simply don't have the resources or technical know-how to perform major-to-major migration path refactors. Often, it's a technically risky move. And so major version updates are often viewed as a "do we need this?" kind of matter -- where, unless there's some feature that is explicitly needed from the next major version right now, the dependency is locked for application stability reasons.

For each project/repo there is mentioned a Team. I imagine that's the team owning the project

@chr-tatu - That list is indeed a few years out of date and mentions many code owners that have long left, and repos that have been retired.

More generally though, while there are teams attached to the list those teams often have other priorities, and in some ways team/app associations are a bit imprecise as @kajatiger mentions. Meaning: Platform owns Force according to the list, but much of force is UI centric in ways that Platform would be ill-equipped to manage. Each codebase has some similar property, and that's why there are repo maintainers as those individuals are generally the best equipped to manage things holistically. It's just hard to be a maintainer while also having sprint work.

This is all a very long-winded way of suggesting that there might be an alternative way to organize dependency rotations that live outside of a team-based structure and that we could think that through a bit more.

In general though, once the particulars are ironed out, <3 the idea behind this RFC.

joeyAghion commented 3 years ago

We should try to keep the current list of team affiliations up-to-date as teams and systems and projects evolve. They're helpful for all sorts of chores; not just dependencies. The Exchange example that you mention was decided that way at the time because its functionality was split across Purchase and Partner teams, so there wasn't one most natural team affiliation.

I'd like to highlight security patches as the most important subset of dependency updates to manage. This has been a struggle in the past, but there's work in progress within Platform (PLATFORM-3430🔒 ) to make these more visible (with a dashboard) and actionable (by notifying affiliated teams instead of just 1 point-person). Some systems aren't actively developed and I'm not that concerned about updating their dependencies (other than security patches). They may be "mature" rather than automatically "technical debt," and it'd be fine with me to update them only when helpful.

mc-jones commented 3 years ago

Adding detail to the Platform work @joeyAghion mentioned. There was a discussion at the July 1st Platform Practice 🔒 , which touched on a lot of the issues being brought up here. While we assumed the use of Dependabot, our potential outcomes should be solution agnostic, so I don't see an issue with just swapping Depfu in if thats what makes the most sense!

Below was the process we were thinking of moving forward with for security updates (the bullets are my added thoughts based on this RFC).

Each Repo is designated a team to handle vulnerability alerts
- Based on Project List, as you also suggested, but seems there is more discussion to be had here, as mentioned by @damassi
When an alert is raised it will make the Team Lead the Assignee and the team as the Reviewers. The TL would oversee that the alert is addressed according to team process.
- We didn't want to be overly prescriptive here as to how teams should action vulnerability alerts (what works best for one team, may not be the best approach for another). Maybe we can use the rotation approach as the default recommended approach for teams to adopt if there is enough support behind this RFC?
- Our reasoning for the team Lead being the point person is to address the issue @mdole flagged regarding the wide variability in effort needed to update. The hope is that they would be able to work with the team and PM to assess the time necessary to address and build into sprints (or whatever process made they decided on), and action accordingly. There was an acknowledgment in this, also, that some team members may really be gung-ho on updating dependency while others may not be. Our hope this would empower the former and not burden the latter.
If the responsible team member determines that we don't need to follow the recommended action of Dependabot, they can dismiss the notification using one of the following options:

Ultimately, I'm flagging because I think they dovetail nicely, and would argue we want to avoid having separate processes for security updates and non-security updates. Ideally, the process would be the same but with vulnerability updates receiving higher priority.

kajatiger commented 2 years ago

Resolution

We decided to try it in the PX-team.

Level of Support

3: Majority acceptance, with conflicting feedback.

Additional Context:

Most people are in favor of it, but just don't know where to start.

Next Steps

We are experimenting with the dependency update rotation in the PX-team and it is going well so far.