Open pebrc opened 5 years ago
I'll try to summarise a few things I've learned while working on the migration to CRD v1. Which does not answer this issue completely but may give a bit more context.
Adding a new CRD version mostly consists in adding it to the supported versions
slice in the CRD. Setting storage: true
means the CRD format will be used as backend storage in etcd. Setting served: true
means a user is able to retrieve a resource in that particular version.
In Kubernetes < 1.16, we can only specify a single OpenAPI validation that matches all CRD versions. In Kubernetes 1.16 and above, we can specify the OpenAPI validation per version. Cf. https://github.com/elastic/cloud-on-k8s/issues/2044#issuecomment-545365870.
In theory all CRD versions with served: true
are backward-compatible with each other. The user can request the same resource in v1
and v1beta1
version. In order to deal with the conversion from one version to another, we can implement a conversion webhook. When a resource is retrieved or written in a version different from its stored version, the webhook is called to convert the resource to the stored version. Conversion webhooks are only available starting Kubernetes 1.15 by default. Some users may also want to disable any webhook we set. As such, it's hard for us to completely rely on conversion webhooks.
In case there is no webhook and we retrieve a resource in a version that does not match its stored version, the resource is converted to the new version through a no-op conversion by the APIServer. The no-op conversion just keeps the exact same resource payload but changes its apiVersion
field. This is probably a lossy conversion if the newer version does not understand JSON fields of the old version.
In the operator code, we only use a single version (the last one), retrieved through the mechanism above (conversion webhook or no-op conversion). Any update on that resource from the operator (including an update with no payload change) automatically changes the stored version of the resource to the updated one.
Removing support for a particular version can be done by:
served: false
for that version. Technically the version is still there but cannot be used by the user.kubectl apply
. It requires programatically updating the existing CRD status sub-resource first. Cf. https://github.com/elastic/cloud-on-k8s/issues/2196 for more details.
Before removing a version from the CRD, we're supposed to make sure all existing resources are not stored in that version. In case we need to do that in the future, this could be done by the operator itself doing no-op updated on every resources it manages to make sure they are using the stored version.
It is impossible for the operator to know the stored version of a given resource. If we need to know it, we may want the operator to store it in an annotation.Owner references set on some resources (eg. Secrets) may reference resources with an old apiVersion (eg. v1beta1
which is deprecated). This is fine.
Let's imagine we introduce crd v2
with a completely different schema. We should make sure the following test succeeds:
v2
v1
, and update it (no-op) in v1
v2
: should be the exact same content as step 1When using conversion webhooks is a valid option, the conversion from v1 to v2 and v2 to v1 can be handled by the webhook directly. Additional information of v2 when converting to v1 could technically be stored in annotations, to be retrieved during the v1 -> v2 conversion.
When webhooks are not available, things get much more complicated. I don't see an easy way to handle conversions, except tracking the latest stored version in an annotation of the resource so the operator knows there might be a mismatch.
Another way to deal with breaking changes could also be to simply stop serving the old version:
v1
v2
v1
to v2
with a script (effectively running what would otherwise be the conversion webhook)v1
(served: false
or remove it entirely from the CRD)v2
. At this point the user cannot manipulate v1
resources anymore since they're not served.If the operator has RBAC access to the CRD kind, steps 2 to 5 could be done by the new version of the operator itself at startup. I'm not sure yet how this would accommodate Red Hat Operator Lifecycle Management.
It's as simple as applying the newest operator manifest so it replaces the existing one. We currently deploy the operator with a StatefulSet, to ensure both versions are not running at the same time.
If we need to deal with breaking changes, we may decide to ignore any resource created with an old version of the operator, using the mechanism already in place.
v2
version. Improving the existing v1
is fine as long as it does not break any other v1
resource. Adding new optional fields is OK. I think this is ready to close but it would be maybe worth capturing this summary somewhere else (ADR?) so that we can don't bury this.