fiaas / fiaas-deploy-daemon

fiaas-deploy-daemon is the core component of the FIAAS platform
https://fiaas.github.io/
Apache License 2.0
55 stars 31 forks source link

RFC: Semantic versioning based release process #163

Closed oyvindio closed 2 years ago

oyvindio commented 2 years ago

Background

The current "continuous delivery"/rolling release like workflow for releasing fiaas-deploy-daemon works by automatically having successful CI builds of the master branch update the latest tag, and have the stable tag updated by a separate manual promotion stage in the CI tool. The tags themselves are json files contained within the fiaas/releases repository. Skipper can then be configured to deploy the version of fiaas-deploy-daemon each of these tags point to. Since https://github.com/fiaas/skipper/pull/119, skipper also supports specifying the version of fiaas-deploy-daemon as a configuration flag in its helm chart.

This workflow originates from when FIAAS was still an internal project, and was intended to make it easy to roll out new releases of fiaas-deploy-daemon in several clusters. In an open source context however I think there are some drawbacks to this model;

Goals

Suggested Process

The release model for fiaas-deploy-daemon should move towards a more typical release process of creating tagged releases, where semantic versioning is used to indicate backwards incompatible changes. For a consistent developer experience, creating a new release can be done the same way as in k8s.

Release Process

Versioning

fiaas-deploy-daemon will use semantic versioning:

Skipper

There are a few approaches for how we can handle Skipper when transitioning to a new release model for fiaas-deploy-daemon.

Update Skipper to support the new release model

Skipper could be updated to support the new release model. It would require some modifications;

Deprecate Skipper

From my perspective we don't get a lot of value from using Skipper to deploy fiaas-deploy-daemon. As such one option to simplify the release process changes, as well as the deployment process itself, could be to deprecate skipper together with the change in release model, and switch to providing a helm chart (or similar) for deploying fiaas-deploy-daemon directly.

This approach would have some benefits:

Implementing The Suggested Process

Assuming we move forward with creating a helm chart for deploying fiaas-deploy-daemon and deprecate skipper, the transition to the suggested release process can look like the following.

mortenlj commented 2 years ago

While I see how the suggested process will improve the three points mentioned at the start, I'm not sure this is the best solution. In my view, the suggested process will introduce other problems that aren't properly accounted for, and which I believe might become bigger problems in the long run.

No more continuous deploy of fiaas

This is probably the biggest issue for me. Both on principle and practical.

We built fiaas because we believe that continuous deploy is the best and safest way to deploy software in a fast moving world built on kubernetes and container orchestrating. When you believe that, it's hard to see how not doing CD is the right thing for our own software.

In practice, this change will also mean that every operator that uses fiaas needs to do much more work to keep updated. In many cases today, you don't actually have to do anything to keep up to date with the latest changes/features in fiaas. You might need to keep your cluster updated, and get involved when larger issues are discussed, but minor fixes and improvements can be rolled out without you needing to spend any effort. When moving to a model based on strictly versioned helm charts, every bugfix or improvement that is to be deployed to your cluster needs to be manually handled.

Loss of momentum

While the current situation requires coordination across organisations to get stable moved forward, this has a side effect of actually making organisations engage in the "head" development of fiaas. When moving to a model with release branches and back-porting, I think there is a risk of organisations "settling" for using a release branch with the occasional back-port until there is something they really need that can't be back-ported. This means that for every new feature at the "head" of the development tree (master branch), less people/organisations will be involved in designing and implementing it, leading to features that might be a bad fit for other organisations/use cases.



I think there is a description of the original idea behind Skipper somewhere, but I can't find it. I'm guessing it's either lost, or hidden in some Schibsted-internal tooling :stuck_out_tongue:. What we have now is only part of that idea, implemented more like an MVP than a fully delivered concept. My gut feeling is that if we had implemented the original concept fully then at least some of the problems mentioned at the start here would have been less prominent.

In short, the idea was that in addition to latest and stable we had additional org-specific latest and stable channels for the "leading" orgs (however you want to define that). When FINN wanted to test latest, they would promote it to finn:latest, and deploy that a suitable place. When they were satisfied they would promote it to finn:stable and deploy that to production. When other leaders promoted to their own stable channels (adevinta:stable and lbc:stable say), the stable channel would be moved forward automatically according to some sensible algorithm (last common build or something like that). That way smaller orgs could "piggyback" on the efforts of the larger orgs and get a stable that kept moving forward, while the leading orgs would be able to move forwards at their own pace.

It would require work to improve how we treat and work with channels today, and it wouldn't solve all the problems. In particular, you would not solve the problem of deploying backwards incompatible changes, but I feel that is a property of doing CD you have to find a way to live with. In the rare cases where the change A->B is incompatible, the proper solution might involve finding an other path from A to B, where each step is compatible with the previous.

Another point I'd like to make is that it would be possible to improve the test suite to a point where you feel confident that when the tests pass, this version is at least good enough for latest. Over at NAIS we have so much confidence in our tests that we always deploy to all clusters (dev and prod) if the tests are green.

oyvindio commented 2 years ago

Yes the suggested model is not perfect and probably has several drawbacks. I think it might still be an improvement on the current model

No more continuous deploy of fiaas

I think this might already be the situation: I would not say that we deploy fiaas-deploy-daemon continuously, but rather update it manually via skipper. It might be interesting to hear if this is how other operators work too, or if the auto update feature is used extensively.

Continuous deployment is no doubt an excellent way to work assuming one has good monitoring and end to end control of a system, for example within a single organisation. FIAAS is built to support that workflow. I don't think it is always the best model for everything though. For FIAAS itself for example, with multiple operators and where different operators may also use a different set of features, I don't think it is necessarily a good model based on how it is used in practice. Assuming that most people are updating fiaas-deploy-daemon manually already, I think that a model that uses versioned releases would be more suitable, since it among other things makes it clearer what has changed.

Loss of momentum

There is the possibility that operators may run older releases for some time, but I think that in general the upgrade path would be to move on to the most recent release and not to patch an old release. I see the ability to patch a older release as an exception, not the rule. It could for example be an option for cases where it is necessary to support a old version of Kubernetes for some time (i.e. temporarily), when that version is no longer supported in the most recent release.

In general I think that the release model suggested above could increase momentum, because it makes it easier to e.g. remove deprecated features and stop supporting older Kubernetes versions. These are things that simplify the software and make it easier to change.

org-specific latest and stable channels

I think a setup with operator specific channels might be an improvement on the current model in some aspects, but a setup like that could have other drawbacks: It seems to me more complex, and might lead to operators maintaining their own labels and only using those, which might make the stable channel less "stable". I like the simplicity of a versioned release model, and I think improving how backwards incompatible changes are handled is a significant benefit of the suggested model.

xavileon commented 2 years ago

I support this proposal.

There are pros/cons as any change and I understand most of Morten points. But I think is a shift for the better overall. In short:

I think this might already be the situation: I would not say that we deploy fiaas-deploy-daemon continuously, but rather update it manually via skipper. It might be interesting to hear if this is how other operators work too, or if the auto update feature is used extensively.

Adevinta uses this feature extensively as we do believe CD is the best path forward. However, not having skipper is actually a benefit for us as we can leverage our already existing CD process and align with other helm chart deployments.

When you believe that, it's hard to see how not doing CD is the right thing for our own software.

From my view, it's not that we are not doing CD anymore (or that we don't want to), but that CD is moved to an operator concern (instead of a built-in FIAAS feature).

henrik242 commented 2 years ago

In general I think that the release model suggested above could increase momentum, because it makes it easier to e.g. remove deprecated features and stop supporting older Kubernetes versions. These are things that simplify the software and make it easier to change.

Wholeheartedly agree to this. I see there's a lot of concern in updating or adding features to the current version out of fear of introducing breaking changes to existing users. I guess a (new) tag (current?) that tracks the latest version should be maintained for users that still want to stick to the bleeding edge.

oyvindio commented 2 years ago

it moves the release process (CD) to be a concern of the operator, which I think is better in a multi-operator project like FIAAS.

This is a good point and I think it summarizes well what I would like to improve with this proposal in terms of process. πŸ‘

xamebax commented 2 years ago

Excited about this RFC and the discussion here. Thanks for spending the time on crafting this, Øyvind!

I guess I'm just adding my pebbles to the pond with my comments below. πŸ™‚

In practice, this change will also mean that every operator that uses fiaas needs to do much more work to keep updated.

Yes, and no. Releasing new features/deprecations in FIAAS until now - correct me if I'm wrong - required a lot of work from maintainers and operators because of coordination. That work will now be gone, freeing up resources and making prioritizing easier.

I think there is a risk of organisations "settling" for using a release branch with the occasional back-port until there is something they really need that can't be back-ported.

That is a valid observation Morten, but is it actually something we need to care about?

mortenlj commented 2 years ago

That is a valid observation Morten, but is it actually something we need to care about?

I'm not sure tbh. :slightly_smiling_face: I believe it will happen, because that's just how things work (either it's a perceived security/stability thing, or just "haven't got time for this" thing). I'm less certain about it being a problem, but ideally users would be involved in discussions about new features that they might use. If they are staying behind on an old version, that discussion is less likely to take place (because new features aren't interesting for them until they get around to upgrading), which again means we might be designing features that won't match their needs when they get there.

birgirst commented 2 years ago

I support this as well, thanks for the writeup Oyvind. πŸ‘

From my view, it's not that we are not doing CD anymore (or that we don't want to), but that CD is moved to an operator concern (instead of a built-in FIAAS feature).

Just to add to this: At a past point in time at Schibsted we were operating FIAAS across multiple multi-tenant clusters separate from cluster operators. Operations of FIAAS on those clusters was effectively delegated to us. We were actively working on several features to support the needs at the time which required us to shorten the feedback loop and get features to users outside the rhythm of regular cluster maintenance which we had less to do with. At the time there were also new tenants being added that required quickly bootstrapping FIAAS in their namespaces as part of an automated onboarding process. With hundreds of instances of fiaas-deploy-daemon across multiple clusters, pushing updates was a pain point. Introducing Skipper and release channels for FIAAS helped us achieve what we needed to support the use cases above and be able to operate the instances at scale. Since a couple of years back we have made changes to how we are managing FIAAS across our clusters. We have moved to manage FIAAS as part of normal cluster operations and then the need for Skipper is not there in the same sense. Ideally we want to be specific about versions and be in control when we choose to upgrade.

At this point I think it makes sense to move to the suggested model for being able to avoid being limited by backwards compatibility, support for deprecated kubernetes versions, speeding up being able to make contributions and avoid needing to synchronise with other operators before being able to cut a release.

oyvindio commented 2 years ago

Thanks for the feedback, and thanks for also including the historical perspective, Birgir. πŸ‘

It has been some time now and there has been a few comments on this suggestion. As I read all the feedback, there are some concerns, but it seems to me that most of the feedback supports implementing the proposed release model. Based on that I would like to start the technical implementation of this proposal. Expect a pull request(s) as soon as there is some available capacity to work on this.

oyvindio commented 2 years ago

I'm closing this as I've merged #180 which implements release tooling to create releases based on semantic versioned git tags, and created release v1.0.0.

If you need to create a release, take a look at the "Creating a release" part of the developer documentation.