Azure / azure-service-operator

Azure Service Operator allows you to create Azure resources using kubectl
https://azure.github.io/azure-service-operator/
MIT License
741 stars 196 forks source link

Upgrade across multiple major versions #4064

Closed hmi12 closed 2 months ago

hmi12 commented 5 months ago

Describe the current behavior Currently, we're runing ASO v2.0.0-beta.2 Is that possible to directly upgrade to the latest version?

matthchr commented 5 months ago

You cannot go directly from beta.2 to the latest GA version, as there were a number of resource management changes between those two versions.

You need to pay close attention to the following changes:

I would recommend you read the other breaking change notices as well just to make sure you're not using the resources impacted.

The recommended upgrade pattern would be to go to every individual ASO version. This isn't strictly required but it's safest and is what we recommend. That way if something goes wrong it's obvious what the old/new versions are and the changes that might be causing the problem. You're more likely to get quality support from us following this pattern.

A (relatively) cautious but still slightly more risky upgrade would be: v2.0.0-beta.2 -> v2.0.0-beta.4 -> v2.0.0-beta.5 -> v2.0.0 -> v2.1.0 -> v2.3.0 -> v2.4.0 -> v2.7.0

This hits all of the versions that contain major changes but skips over some of the minor version releases that don't have major changes.

A risky but it might work upgrade v2.0.0-beta.2 -> v2.0.0-beta.4 -> v2.0.0-beta.5 -> v2.0.0 -> v2.7.0

This hits the minimum versions that you MUST hit to get from where you are to latest.

Note that in all cases, when you do the upgrade from v2.0.0+ to v2.4.0+ you must follow the v2.4.0 instructions on beta CRD deprecation and swap your CRDs to the GA versions. I would recommend you just do it one ASO version at a time (the recommended pattern). You don't need to actually spend lots of time at each ASO version, you can upgrade to a version, ensure the ASO pod launches successfully with no errors, maybe re-apply one of your resources with a simple edit (change tags or similar) to make sure things are working, and then upgrade again to the next version.

hmi12 commented 5 months ago

@matthchr Really appreciate your detailed recommendation. We need to conduct some verification in the testing environment. Or, is the following solution feasible?

  1. Add the "skip-reconcile" annotation to all Azure resources;
  2. Uninstall the old version of ASO from AKS;
  3. Install the latest version of ASO directly;
  4. Finally import the Azure resources using asoctl.
matthchr commented 5 months ago

That should at least in theory also work. Note that the annotation is reconcile-policy.

You'll need to make sure that you uninstall the CRDs too (which Helm won't do by default but you can do manually once you've deleted all of the instances of the CRDs).

Since asoctl gives you YAML that you still might need to massage a bit (for providing secrets, etc), and you also already have some (beta) YAML whose shape is likely very similar to the GA YAML shape, it's not clear to me if it'll be easier to start completely from scratch with asoctl imported resources or if it'd be easier to just modify your YAMLs locally to move from beta to GA version of CRDs (which if you follow that breaking change documentation should just be the version itself and maybe a few other small things) and then reapply them.

As to which is easier, full upgrade outlined above or this approach, it probably depends on how many ASO resources you have. if you have hundreds or thousands of resources you'd need to re-import, it'll probably be easier to just do the upgrade, even accounting for the fact that some of those resources may need to be updated due to the breaking changes mentioned above. Most resources will just need their version changed by swapping the v1beta1 to v1api. On the other hand, if you don't have that many resources, marking them as reconcile-policy: skip, deleting them (in k8s but not azure) and then re-importing with asoctl might be easier.

matthchr commented 5 months ago

It's also worth noting that while the above is a lot of special-cases and gotchas, that's primarily because of the large amount of time between beta2 and 2.7.0, the fact that the beta CRDs were deprecated, and the fact that in beta.5 we added so many CRDs that we couldn't use Helm to manage them anymore because the chart was too large, we had to start managing them ourselves.

Once you're into the GA version (2.0.0+), there are technically small breaking changes here and there but none that are going to impact every resource like the beta->GA migration does. I wouldn't expect a hypothetical v2.5.0 -> 2.14.0 to be this complicated.

theunrepentantgeek commented 4 months ago

How did you get on? Did you successfully upgrade - and which route did you take?

hmi12 commented 4 months ago

The upgrade is still pending on our task list. We might test both methods in the test environment, but we haven't started yet. We will update here if any new findings.

hmi12 commented 4 months ago

@matthchr @theunrepentantgeek We tried to uninstall ASO 2.0.0-Beta2, also including remove the old CRDs and then install the latest version, but an error occurred while installing the new CRDs. It seems that the deprecated version is still present in etcd and cannot be manually removed. We might need to use asoctl clean crds to migrate the deprecated CRDs, but the prerequisite is: Ensure the current ASO v2 version in your cluster is beta.5.... Therefore, it seems we have to follow the recommended solution and upgrade through each individual ASO version sequentially.

Error message during install latest CRDs: request to convert CR from an invalid group/version: resources.azure.com/v1beta20200601

matthchr commented 3 months ago

That upgrade documentation was definitely written with the "upgrade 1 version at a time" in mind. The reason for the "must be beta.5" is because asoctl clean crds will only remove the beta CRDs if there are other versions "ahead" of them (the GA versions). So it won't work in your case because the CRDs are still old and don't have the new versions yet. BUT: if you've already deleted all of your old Custom Resources and it's just the CRDs that are left, you could just delete the ASO CRDs too and then reinstall them.

Normally deleting CRDs is scary/bad, but if you know there are no instances of the CRs in the cluster it should work. Going 1 version at a time should also work.

theunrepentantgeek commented 2 months ago

No further response, closing. Feel free to reopen if you have further questions.