aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.53k stars 3.86k forks source link

(aws-eks): Inconsistent error when updating Managed Nodegroups: "Version and ReleaseVersion updates cannot be combined with other updates" #13602

Open bleish opened 3 years ago

bleish commented 3 years ago

We've recently switched from using self-managed nodes (auto scaling group) to a managed nodegroup for our eks clusters. In running various eks version updates, we've run into this error a few times, but there doesn't seem to be any real consistency. The error simply says: "Version and ReleaseVersion updates cannot be combined with other updates", and is thrown when updating the managed nodegroup. We are using a custom image, so we have the managed nodegroup configured through a custom launch template instead of the standard options.

Initially, we thought the error was due to us updating both the EKS version and the custom image in the same deploy, however we ran into an inconsistency that says otherwise. We deployed a 1.16 to 1.17 eks update to two identically configured clusters in two different aws regions: us-west-2 and eu-west-1. The only changes to both clusters was updating the eks version, and the custom image. However, one (us-west-2) passed the eks and image upgrade, while the other (eu-west-1) successfully updated the eks version, but failed during the managed nodegroup update with the aforementioned error. It also doesn't seem to be an issue with region, as we've had this error occur in previous deploys to us-west-2 as well. It leads us to believe that it might be some sort of race condition in what gets updated first. It could also be related to the way the launch template is created and handled by the cdk.

Reproduction Steps

This is our current setup. Deploying a 1.16 cluster, then running an update where the only change is in the eksVersion variable going from 1.16 to 1.17 might cause this error to occur.

const eksCluster = new eks.Cluster(this, "OurEKSCluster", {
    vpc: ourVpc,
    vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE }],
    version: eks.KubernetesVersion.of(eksVersion), // We've experienced this issue with versions 1.16, 1.17, and 1.18
    defaultCapacity: 0,
    clusterName: "OurEKSCluster"
});

const customAmi = ec2.MachineImage.lookup({ name: `custom-ami_${eksVersion}` });

const userData = ec2.UserData.forLinux();
userData.addCommands(
    'set -o xtrace',
    `/etc/eks/bootstrap.sh ${eksCluster.clusterName} --kubelet-extra-args --node-labels=lifecycle=Ec2Spot`
);

const launchTemplate = new ec2.CfnLaunchTemplate(this, 'LaunchTemplate', {
    launchTemplateData: {
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3A, ec2.InstanceSize.XLARGE).toString(),
        imageId: customAmi.getImage(this).imageId,
        keyName: "some-key",
        userData: Fn.base64(userData.render()),
        monitoring: {
            enabled: true
        },
        blockDeviceMappings: [{
            deviceName: "/dev/xvda",
            ebs: {
                deleteOnTermination: true,
                volumeSize: 50
            }
        }], 
        tagSpecifications: [{
            resourceType: "instance",
            tags: [
                new cdk.Tag("SomeKey", "SomeValue")
            ]
        }]
    },
});

const managedNodeGroup = eksCluster.addNodegroupCapacity("ManagedNodeGroup", {
    launchTemplateSpec: {
        id: launchTemplate.ref,
        version: launchTemplate.attrLatestVersionNumber
    },
    minSize: 1,
    maxSize: 20,
    capacityType: eks.CapacityType.SPOT
});

What did you expect to happen?

The EKS version is updated successfully, followed by the custom image being applied to the nodes on the managed nodegroup.

What actually happened?

The EKS version is updated, but the error, "Version and ReleaseVersion updates cannot be combined with other updates", is thrown during the managed nodegroup update step.

Environment

Other

In searching for this error, we've only come across one other similar issue. This one was with eksctl, but had the same error message when updating a managed nodegroup: https://github.com/weaveworks/eksctl/issues/2565


This is :bug: Bug Report

iliapolo commented 3 years ago

@bleish You mention that:

Deploying a 1.16 cluster, then running an update where the only change is in the eksVersion variable going from 1.16 to 1.17 might cause this error to occur

This strikes me as odd because in this scenario, the node group should be left untouched, as nothing pertaining its properties has changed.

Can you please share the CDK output/CloudFormation events during this scenario? As well as the result of cdk diff?

Also, one thing that caught my eye in the code you posted is:

const customAmi = ec2.MachineImage.lookup({ name: `custom-ami_${eksVersion}` });

Are you comitting the cdk.context.json? or could this value depend on execution time, unintentionally changing the launch template without your knowledge?

P.S I am able to get this error when I update both the launch template, and another property of the node group (labels for example). But this is related to an inherent EKS limitation described here: https://github.com/aws/containers-roadmap/issues/1258

bleish commented 3 years ago

@iliapolo I apologize for the delayed response.

That EKS limitation you mentioned was what I had expected to be the issue, but the problem is that we ran the exact same update against two clusters in the same state, but one failed and the other didn't, so it allowed the launch template and node group to both be updated in a single deploy for one of the clusters, but not the other.

It is also strange because, unless something is going on behind the scenes I don't know about, the version affects the Cluster construct and the Launch Template, but not the Managed NodeGroup. The Launch Template itself IS a property on the Managed NodeGroup (as launchTemplateSpec), so I wonder if it's possible that because we are updating the Launch Template, which is then updating the Managed NodeGroup, the error is thrown. However, sometimes it does, sometimes it doesn't.

As far as logs, we unfortunately don't have a diff of that particular deploy, and due to the way the clusters were set up, I can't re-create the same conditions to have one pass and another fail. I do know that the only update we pushed to these clusters was to update the EKS version from 1.16 to 1.17 on the Cluster, and to update the custom AMIs on the Launch Template. Here are the CloudFormation event logs for those two cluster updates:

eksUpdateFailure-EU.txt eksUpdateFailure-NA.txt

The EU cluster failed on the ManagedNodeGroup for the reason we are discussing while the NA cluster failed for a different reason (40 minutes after). In my experience with this issue in the past, the NA cluster should have failed at the same time as the EU, but didn't.

We aren't committing the cdk.context.json, so it is created fresh on every deploy.

iliapolo commented 3 years ago

@NGL321 to follow up.

markussiebert commented 3 years ago

Im struggling with the same Problem and I think I can give you some information on this.

With updating the eks version from 16 to 17 you changed the ami id:

const customAmi = ec2.MachineImage.lookup({ name: `custom-ami_${eksVersion}` });

This updates the launchtemplate:

const launchTemplate = new ec2.CfnLaunchTemplate(this, 'LaunchTemplate', {
    launchTemplateData: {
        ...
        imageId: customAmi.getImage(this).imageId,
        ...
    },
});

You see - there is a direct property change of the launchtemplate. So, a new version will be created ... and that is the problem i think.

const managedNodeGroup = eksCluster.addNodegroupCapacity("ManagedNodeGroup", {
    launchTemplateSpec: {
        id: launchTemplate.ref,
        version: launchTemplate.attrLatestVersionNumber
    },
    minSize: 1,
    maxSize: 20,
    capacityType: eks.CapacityType.SPOT
});

Here is no "direct" reference to the Launchtemplate version. I think i saw, that cloudformation parallelizes both thanges - a custom "DependsOn" for the nodegroup worked in this case.

So I think in your case in EU the launchtemplate version was created before the updatenodegroup was initiated, in NA it wasn't.

I can be wrong - but that's the only conclusion I had while facing the same issue.

Another issue with this:

Even if you accept everything and pin the ami for the next deployment, cloudformation will fail - because the rollback created a new version (update to eks 1.17 ami is a new version, cf rollback to eks 1.16 ami is another one). So even if you think you won't make an ami update - and the launchtemplate won't change with the deployment - cloudformation will update the launchtemplate of the eks nodegroup.

I created a custom resource handling this for me - but now im struggling with the 1 hour custom resource limit set by AWS. So at the moment i am somehow disappointed with the managed node groups.

christallire commented 8 months ago

lastestVersionNumber or versionNumber causes trouble. use defaultVersionNumber and then manually change it on AWS console.