Kong / kong-operator

Kong Operator for Kubernetes and OpenShift
https://konghq.com
Apache License 2.0
58 stars 27 forks source link

Operator projects using the removed APIs in k8s 1.22 requires changes. #65

Closed camilamacedo86 closed 2 years ago

camilamacedo86 commented 3 years ago

Problem Description

Kubernetes has been deprecating API(s), which will be removed and are no longer available in 1.22. Operators projects using these APIs versions will not work on Kubernetes 1.22 or any cluster vendor using this Kubernetes version(1.22), such as OpenShift 4.9+. Following the APIs that are most likely your projects to be affected by:

Therefore, looks like this project distributes solutions via the Red Hat Connect with the package name as kong-offline-operator and does not contain any version compatible with k8s 1.22/OCP 4.9. Following some findings by checking the distributions published:

NOTE: The above findings are only about the manifests shipped inside of the distribution. It is not checking the codebase.

How to solve

It would be very nice to see new distributions of this project that are no longer using these APIs and so they can work on Kubernetes 1.22 and newer and published in the Red Hat Connect collection. OpenShift 4.9, for example, will not ship operators anymore that do still use v1beta1 extension APIs.

Due to the number of options available to build Operators, it is hard to provide direct guidance on updating your operator to support Kubernetes 1.22. Recent versions of the OperatorSDK greater than 1.0.0 and Kubebuilder greater than 3.0.0 scaffold your project with the latest versions of these APIs (all that is generated by tools only). See the guides to upgrade your projects with OperatorSDK Golang, Ansible, Helm or the Kubebuilder one. For APIs other than the ones mentioned above, you will have to check your code for usage of removed API versions and upgrade to newer APIs. The details of this depend on your codebase.

If this projects only need to migrate the API for CRDs and it was built with OperatorSDK versions lower than 1.0.0 then, you maybe able to solve it with an OperatorSDK version >= v0.18.x < 1.0.0:

$ operator-sdk generate crds --crd-version=v1 INFO[0000] Running CRD generator.
INFO[0000] CRD generation complete.

Alternatively, you can try to upgrade your manifests with controller-gen (version >= v0.4.1) :

If this project does not use Webhooks:

$ controller-gen crd:trivialVersions=true,preserveUnknownFields=false rbac:roleName=manager-role paths="./..."

If this project is using Webhooks:

  1. Add the markers sideEffects and admissionReviewVersions to your webhook (Example with sideEffects=None and admissionReviewVersions={v1,v1beta1}: memcached-operator/api/v1alpha1/memcached_webhook.go):

  2. Run the command:

    $ controller-gen crd:trivialVersions=true,preserveUnknownFields=false rbac:roleName=manager-role webhook paths="./..."

For further info and tips see the blog.

Thank you for your attention.

shaneutt commented 3 years ago

Thanks for the report :+1: we the maintainers will talk about this in our next team sync and will update here after.

rainest commented 2 years ago

I believe this should be handled if we release a version of the operator based on KIC 2.0, since KIC 2.0 does upgrade affected resources to the new API versions.

Although we have lagged behind Operator SDK versions, the Helm-based operator we distribute doesn't use the Operator SDK to generate the bulk of those resources independently; they're all generated (as of KIC 2.0) in https://github.com/Kong/kubernetes-ingress-controller using controller-gen and such and then copied over into https://github.com/Kong/charts/blob/main/charts/kong/ before being then again copied over into the operator.

The one exception is the kongs.charts.helm.k8s.io CRD, which is specific to the operator, and is still v1beta1. Need to figure out how to upgrade it (not sure if it was generated from something originally using operator-sdk or if we should update it manually).

Aside from that, we currently lack build tooling to generate UBI versions of the controller container image. The original image uploaded for the offline operator was a one-off build I did as a proof of concept. On a technical level, we can add support for this to our CI image builds without much effort (it was done, but not actually merged). @mflendrich had raised objections to it on the basis that the license information we're required to include in those images was probably incomplete/out of compliance.

camilamacedo86 commented 2 years ago

Hi, would we have any update on this?

See that we are very close to the release data and fix the projects seems not very hard. See how to fix it in the first comment. Then, would be great to be able to check a new version of your project distributed which is compatible with 4.9.

rainest commented 2 years ago

We understand the process to fix it, it's just that the operator is on the tail end of things we need to update--the CRD updates cascade down from KIC to our Helm chart and finally down to the operator, and we're just about to complete the chart portion of that: https://github.com/Kong/charts/pull/470

Operator update will come after that--the metadata accuracy question is still open, but IMO given that we do have the older version available in the wild with the same problem metadata already we may as well roll a new version of the operator anyway once we have the upstream releases done to clear the compatibility issue.

camilamacedo86 commented 2 years ago

Hi @rainest,

Why not only provide a version that works on 4.9+ and then afterwords provide another one that came along with teh further improvements and changes? At least your users would be able to use it k8s 1.22+ cluster and OCP 4.9+ asap.

If you are using teh v1beta1 only for CRDs the solution is very trivial. Did you check how to fix in the description of this task?

Note that they cannot install OCP 4.9+ with versions of operators which does not work on 4.9+ at all.

rainest commented 2 years ago

There are CRDs and webhook definitions within the application itself that would also be affected. Updating those to the latest API versions was part of the KIC 2.0 release, and it would have been difficult to maintain separate versions of those resources for the operator temporarily--easier to have the final release of the CRDs out and update the operator with those.

That is now done, and I'm working on a draft operator upgrade that incorporates both the updated application resources and a v1 operator CRD). I just let Kubernetes upgrade the existing CRD without the assistance of Operator SDK, but that upgrade path appears to be fine per the blog:

In a live cluster, you can invoke Kubernetes’ conversion functionality by applying a crd.v1beta1 and then kubectl get a crd.v1 to view it’s converted format.

However, per the PR description, we do have an additional problem: we used a <something>.k8s.io group originally and those are now subject to Kubernetes project API review. We probably don't need to be an official API, so it'd probably make most sense to use some new group.

@camilamacedo86 do you know of any smooth process for handling that upgrade? Brief review of other projects suggests that we'll just need to instruct users to handle that manually, i.e. similar to https://cert-manager.io/docs/installation/upgrading/upgrading-0.10-0.11/, starting on 1.21 or older, you'd need to:

  1. Start an outage window (or somehow otherwise do something that accounts for your ingress gateway being temporarily offline).
  2. Save off your existing kongs.charts.helm.k8s.io CRs and copy the specs into new kongs.charts.konghq.com CRs.
  3. Delete all existing kongs.charts.helm.k8s.io CRs and the kongs.charts.helm.k8s.io CRD.
  4. Apply the latest Kong CRDs (KongPlugin, KongConsumer, and such) manually, since Helm doesn't manage those.
  5. Upgrade the operator version.
  6. Deploy the new kongs.charts.konghq.com CRs.
  7. Upgrade your Kubernetes cluster to 1.22.

Assuming that's correct, what should we indicate in the operator metadata? Should that actually go so far as to indicate that 0.9.0 has no skip or replaces versions, i.e. that you're basically installing a new operator that happens to have a manual upgrade path for the old Kong operator's CRs? We should probably bump to 1.0.0 instead if that's the case.

Edit: fielded the CRD portion of the question to #sig-api-machinery chat as well, and they agree that ETL to copy the old CR contents into new CRs with the new group is the best way to go, so no fancy automation to handle that for users. Question on OLM version info for that significant change remains open.

camilamacedo86 commented 2 years ago

Hi @rainest,

However, per the PR description, we do have an additional problem: we used a .k8s.io group originally and those are now subject to Kubernetes project API review. We probably don't need to be an official API, so it'd probably make most sense to use some new group.

Could you please clarify what is your problem with? What API do you use that is removed from 1.22/OCP 4.9 and you are facing issues? What are the issues faced?

@camilamacedo86 do you know of any smooth process for handling that upgrade?

To integrate your project with OLM we need to create a bundle. The bundle is what you publish. The bundle contains all manifests used by your Operator + the manifests required by OLM ( e.g. CSV )

PS.: The legacy format package manifest is supported as well. However, I'd recommend you move forward and adopted the bundle format and take advantage of the SDK tool to build and test it.

PS.: If you are using SDK tool version >= 1.0.0 and respecting the default layout then, it is very easy. You could only run make bundle to generate/update the bundle based on all manifests configured for your project. You also could use the operator-sdk run bundle to test if all will be work fine with your bundle. See https://sdk.operatorframework.io/docs/olm-integration/quickstart-bundle/.

Assuming that's correct, what should we indicate in the operator metadata? Should that actually go so far as to indicate that 0.9.0 has no skip or replaces versions, i.e. that you're basically installing a new operator that happens to have a manual upgrade path for the old Kong operator's CRs? We should probably bump to 1.0.0 instead if that's the case.

In the CSV you configure how the upgrade graph of your operator should work to allow your users to subscribe to a channel and let them upgrade the versions installed via OLM. So, if you do nothing as you say then, it is like a new version every time and you lose the lifecycle advantages provided by OLM. To understand how it works see: https://v0-18-z.olm.operatorframework.io/docs/concepts/olm-architecture/operator-catalog/creating-an-update-graph/

Edit: fielded the CRD portion of the question to #sig-api-machinery chat as well, and they agree that ETL to copy the old CR contents into new CRs with the new group is the best way to go, so no fancy automation to handle that for users. Question on OLM version info for that significant change remains open.

Not sure if I can follow your comment here.

However, an upgrade from v1beta1 to v1 for the CRD version should not change the spec's definitions used by you of your CRD. That means, that your CR would still with the same content.

If you update your Helm charts to be compatible and then, just regenerate the operator using the latest version of SDK for example, all would be scaffolded accordingly and using the latest versions of the API. That is how you upgrade your project from the old to the latest version. See: https://sdk.operatorframework.io/docs/building-operators/helm/migration/

After that, you could just run make bundle and get your whole bundle generated. Then, you would only fill it with the details of your project. See an example: https://github.com/operator-framework/operator-sdk/tree/master/testdata/helm/memcached-operator (Memcached Helm operator. See the bundle dir)

What details? (e.g)

configure the channels configure the upgrade path configure the description of your project

Please see https://sdk.operatorframework.io/docs/olm-integration/generation/#bundle-format know more about the bundle layout.

I hope that can helps you out.

rainest commented 2 years ago

Could you please clarify what is your problem with? What API do you use that is removed from 1.22/OCP 4.9 and you are facing issues? What are the issues faced?

The API group we originally used was kongs.charts.helm.k8s.io. That isn't allowed in 1.22 without approval--anything that uses a .k8s.io suffix requires approval upstream:

The CustomResourceDefinition "kongs.charts.helm.k8s.io" is invalid: 
* metadata.annotations[api-approved.kubernetes.io]: Required value: protected groups must have approval annotation "api-approved.kubernetes.io", see https://github.com/kubernetes/enhancements/pull/1111

We don't want our API to be an approved community API, so we'd need to change the group. While the spec wouldn't actually change, you'd need to create new CRs, copying content from existing kongs.charts.helm.k8s.io instances into new kongs.charts.konghq.com instances.

It looks like the cockroachdb operator went through a similar transition, where earlier versions used a CRD with a k8s.io suffix. While their 2.x release line did include replaces info, the first version that uses the new charts.operatorhub.io group does not indicate any replaces versions, whereas subsequent versions do.

So I expect we need to do something similar, i.e. include no skip/replaces version info in our initial release with the new group. OLM won't be able to handle upgrades from 0.8 to later versions; users will have to handle that transition manually.

rainest commented 2 years ago

@camilamacedo86 We've added the new manual upgrade version and submitted https://github.com/k8s-operatorhub/community-operators/pull/300 to the OperatorHub side of things.

I've attempted to push updated operator and controller images to Red Hat Connect, but they're not showing up and running through the certification scan. Do you have access to any internal logs that would indicate why the images aren't propagating through the system? We've reached out to some other contacts we have at Red Hat, but aren't sure who's best able to diagnose issues with Connect.

camilamacedo86 commented 2 years ago

Hi @rainest,

I am happy that you could address the needs. That is very nice. About the RedHat Connect. could you please open a ticket to get help with?. Sorry, I do have not the info or access to its logs.

rainest commented 2 years ago

We've been able to get past the package upload/validation issues and have released 0.10.0 (additional bump since we needed to bump the SDK version) on Red Hat Connect, so going ahead and closing this.