Handle Operator Upgrades

pebrc commented 5 years ago

Code upgrades seem fairly straightforward
API/CRD updates
- Look into conversion functions to auto-convert to the new API version (I think this is it
  - Using conversion functions would save us from writing bwc compatible code everywhere and instead convert resource to the latest incarnation
  - Kubebuilder v2 supports conversion via webhook conversion (k8s > 1.16)
Verify any upgrade strategy chosen works with operator lifecycle managment (operatorhub.io)
[Optional] Limit impact of control plane upgrades by tying workloads to a specific version of the operator and upgrading workloads groupwise starting with the lowest priority to canary the new operator version with limited blast radius

sebgl commented 4 years ago

I'll try to summarise a few things I've learned while working on the migration to CRD v1. Which does not answer this issue completely but may give a bit more context.

CRD versioning

Adding a new CRD version

Adding a new CRD version mostly consists in adding it to the supported versions slice in the CRD. Setting storage: true means the CRD format will be used as backend storage in etcd. Setting served: true means a user is able to retrieve a resource in that particular version.

In Kubernetes < 1.16, we can only specify a single OpenAPI validation that matches all CRD versions. In Kubernetes 1.16 and above, we can specify the OpenAPI validation per version. Cf. https://github.com/elastic/cloud-on-k8s/issues/2044#issuecomment-545365870.

Version conversions

In theory all CRD versions with served: true are backward-compatible with each other. The user can request the same resource in v1 and v1beta1 version. In order to deal with the conversion from one version to another, we can implement a conversion webhook. When a resource is retrieved or written in a version different from its stored version, the webhook is called to convert the resource to the stored version. Conversion webhooks are only available starting Kubernetes 1.15 by default. Some users may also want to disable any webhook we set. As such, it's hard for us to completely rely on conversion webhooks. In case there is no webhook and we retrieve a resource in a version that does not match its stored version, the resource is converted to the new version through a no-op conversion by the APIServer. The no-op conversion just keeps the exact same resource payload but changes its apiVersion field. This is probably a lossy conversion if the newer version does not understand JSON fields of the old version.

In the operator code, we only use a single version (the last one), retrieved through the mechanism above (conversion webhook or no-op conversion). Any update on that resource from the operator (including an update with no payload change) automatically changes the stored version of the resource to the updated one.

Deprecating a version

Removing support for a particular version can be done by:

specifying served: false for that version. Technically the version is still there but cannot be used by the user.
removing the version from the CRD list of supported versions. This cannot be done automatically by applying the new CRDs via kubectl apply. It requires programatically updating the existing CRD status sub-resource first. Cf. https://github.com/elastic/cloud-on-k8s/issues/2196 for more details. Before removing a version from the CRD, we're supposed to make sure all existing resources are not stored in that version. In case we need to do that in the future, this could be done by the operator itself doing no-op updated on every resources it manages to make sure they are using the stored version. It is impossible for the operator to know the stored version of a given resource. If we need to know it, we may want the operator to store it in an annotation.

Owner references

Owner references set on some resources (eg. Secrets) may reference resources with an old apiVersion (eg. v1beta1 which is deprecated). This is fine.

Dealing with breaking changes in CRD versions

Let's imagine we introduce crd v2 with a completely different schema. We should make sure the following test succeeds:

create a resource in v2
request it in v1, and update it (no-op) in v1
request it in v2: should be the exact same content as step 1

When using conversion webhooks is a valid option, the conversion from v1 to v2 and v2 to v1 can be handled by the webhook directly. Additional information of v2 when converting to v1 could technically be stored in annotations, to be retrieved during the v1 -> v2 conversion.

When webhooks are not available, things get much more complicated. I don't see an easy way to handle conversions, except tracking the latest stored version in an annotation of the resource so the operator knows there might be a mismatch.

Another way to deal with breaking changes could also be to simply stop serving the old version:

Stop the operator using v1
Introduce CRD v2
Convert all existing resources from v1 to v2 with a script (effectively running what would otherwise be the conversion webhook)
Deprecate v1 (served: false or remove it entirely from the CRD)
Start the operator that handles v2. At this point the user cannot manipulate v1 resources anymore since they're not served.

If the operator has RBAC access to the CRD kind, steps 2 to 5 could be done by the new version of the operator itself at startup. I'm not sure yet how this would accommodate Red Hat Operator Lifecycle Management.

Upgrading the operator

It's as simple as applying the newest operator manifest so it replaces the existing one. We currently deploy the operator with a StatefulSet, to ensure both versions are not running at the same time.

If we need to deal with breaking changes, we may decide to ignore any resource created with an old version of the operator, using the mechanism already in place.

Thoughts and questions

Because multi-version OpenAPI schema in CRDs is only available starting k8s 1.16, and conversion webhooks are only available starting k8s 1.15 (if the user is OK using them), I feel like CRD versioning in k8s can only be done the right way starting k8s 1.16. We have to find workarounds for k8s < 1.16.
Based on the above, is it worth handling multiple versions and implementing conversion webhooks at all? Ensuring the operator works with a single version at a time may be simpler.
We should double-check how this would work with Red Hat Operator Lifecycle Manager.
Things are definitely simpler if we don't end up having a backward-incompatible v2 version. Improving the existing v1 is fine as long as it does not break any other v1 resource. Adding new optional fields is OK.

pebrc commented 4 years ago

I think this is ready to close but it would be maybe worth capturing this summary somewhere else (ADR?) so that we can don't bury this.

elastic / cloud-on-k8s