Azure / azure-service-operator

Azure Service Operator allows you to create Azure resources using kubectl
https://azure.github.io/azure-service-operator/
MIT License
702 stars 188 forks source link

Upgrade across multiple major versions #4064

Open hmi12 opened 1 month ago

hmi12 commented 1 month ago

Describe the current behavior Currently, we're runing ASO v2.0.0-beta.2 Is that possible to directly upgrade to the latest version?

matthchr commented 1 month ago

You cannot go directly from beta.2 to the latest GA version, as there were a number of resource management changes between those two versions.

You need to pay close attention to the following changes:

I would recommend you read the other breaking change notices as well just to make sure you're not using the resources impacted.

The recommended upgrade pattern would be to go to every individual ASO version. This isn't strictly required but it's safest and is what we recommend. That way if something goes wrong it's obvious what the old/new versions are and the changes that might be causing the problem. You're more likely to get quality support from us following this pattern.

A (relatively) cautious but still slightly more risky upgrade would be: v2.0.0-beta.2 -> v2.0.0-beta.4 -> v2.0.0-beta.5 -> v2.0.0 -> v2.1.0 -> v2.3.0 -> v2.4.0 -> v2.7.0

This hits all of the versions that contain major changes but skips over some of the minor version releases that don't have major changes.

A risky but it might work upgrade v2.0.0-beta.2 -> v2.0.0-beta.4 -> v2.0.0-beta.5 -> v2.0.0 -> v2.7.0

This hits the minimum versions that you MUST hit to get from where you are to latest.

Note that in all cases, when you do the upgrade from v2.0.0+ to v2.4.0+ you must follow the v2.4.0 instructions on beta CRD deprecation and swap your CRDs to the GA versions. I would recommend you just do it one ASO version at a time (the recommended pattern). You don't need to actually spend lots of time at each ASO version, you can upgrade to a version, ensure the ASO pod launches successfully with no errors, maybe re-apply one of your resources with a simple edit (change tags or similar) to make sure things are working, and then upgrade again to the next version.

hmi12 commented 1 month ago

@matthchr Really appreciate your detailed recommendation. We need to conduct some verification in the testing environment. Or, is the following solution feasible?

  1. Add the "skip-reconcile" annotation to all Azure resources;
  2. Uninstall the old version of ASO from AKS;
  3. Install the latest version of ASO directly;
  4. Finally import the Azure resources using asoctl.
matthchr commented 1 month ago

That should at least in theory also work. Note that the annotation is reconcile-policy.

You'll need to make sure that you uninstall the CRDs too (which Helm won't do by default but you can do manually once you've deleted all of the instances of the CRDs).

Since asoctl gives you YAML that you still might need to massage a bit (for providing secrets, etc), and you also already have some (beta) YAML whose shape is likely very similar to the GA YAML shape, it's not clear to me if it'll be easier to start completely from scratch with asoctl imported resources or if it'd be easier to just modify your YAMLs locally to move from beta to GA version of CRDs (which if you follow that breaking change documentation should just be the version itself and maybe a few other small things) and then reapply them.

As to which is easier, full upgrade outlined above or this approach, it probably depends on how many ASO resources you have. if you have hundreds or thousands of resources you'd need to re-import, it'll probably be easier to just do the upgrade, even accounting for the fact that some of those resources may need to be updated due to the breaking changes mentioned above. Most resources will just need their version changed by swapping the v1beta1 to v1api. On the other hand, if you don't have that many resources, marking them as reconcile-policy: skip, deleting them (in k8s but not azure) and then re-importing with asoctl might be easier.

matthchr commented 1 month ago

It's also worth noting that while the above is a lot of special-cases and gotchas, that's primarily because of the large amount of time between beta2 and 2.7.0, the fact that the beta CRDs were deprecated, and the fact that in beta.5 we added so many CRDs that we couldn't use Helm to manage them anymore because the chart was too large, we had to start managing them ourselves.

Once you're into the GA version (2.0.0+), there are technically small breaking changes here and there but none that are going to impact every resource like the beta->GA migration does. I wouldn't expect a hypothetical v2.5.0 -> 2.14.0 to be this complicated.

theunrepentantgeek commented 1 week ago

How did you get on? Did you successfully upgrade - and which route did you take?

hmi12 commented 5 days ago

The upgrade is still pending on our task list. We might test both methods in the test environment, but we haven't started yet. We will update here if any new findings.