aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.52k stars 3.86k forks source link

(aws-eks): Tag Update in EKS-based Stack triggers "Version and ReleaseVersion updates cannot be combined with other updates" #16644

Open adriantaut opened 2 years ago

adriantaut commented 2 years ago

Hello guys,

We have a CDK Stack that, among others, it bundles together the creation of an EKS Cluster with an EKS Managed NodeGroup. For cost/analytic purposes we have some mandatory tags being applied to all the resources of a Stack. If a Tag is being added/changed/removed, the change is propagated in the CDK tree to all the resources as you can see below.

[~] AWS::EKS::Nodegroup awsrefeksCluster/Cluster/NodegroupNodeGroup awsrefeksClusterNodegroupNodeGroup36246D59
 └─ [~] Tags
     └─ [~] .CiCd:CdkLibPpbVersion:
         ├─ [-] 7.40.0
         └─ [+] 7.41.0
.........
[~] AWS::EC2::LaunchTemplate awsrefeksCluster/NodeGroup-LaunchTemplate awsrefeksClusterNodeGroupLaunchTemplate4E8E0D48
 └─ [~] LaunchTemplateData
     └─ [~] .TagSpecifications:
         └─ @@ -24,7 +24,7 @@
            [ ] },
            [ ] {
            [ ]   "Key": "CiCd:CdkLibPpbVersion",
            [-]   "Value": "7.40.0"
            [+]   "Value": "7.41.0"
            [ ] },
            [ ] {
            [ ]   "Key": "CiCd:Product",
            @@ -65,7 +65,7 @@
            [ ] },
            [ ] {
            [ ]   "Key": "CiCd:CdkLibPpbVersion",
            [-]   "Value": "7.40.0"
            [+]   "Value": "7.41.0"
            [ ] },
            [ ] {
            [ ]   "Key": "CiCd:Product",

Basically the only change is one of the Tag's values which is applied to all the resources, but we are interested in particular about the AWS::EKS::NodeGroup and AWS::EC2::LaunchTemplate as the Stack update fails with the following error:

AwsrefeksStack-ephemeral-tauta: creating CloudFormation changeset...
14:32:16 | UPDATE_FAILED        | AWS::EKS::Nodegroup                   | awsrefeksClusterNo...pNodeGroup36246D59
Version and ReleaseVersion updates cannot be combined with other updates

Did someone else face this issue before and might be able to advise? Thank you a lot!

What did you expect to happen?

Have the Tags updated without their change to cause the NodeGroup and the LaunchTemplate update.

Environment


This is :bug: Bug Report

otaviomacedo commented 2 years ago

OK, so there are two questions here:

  1. Why are the tags being propagated to the whole subtree and how to avoid this? Since you created #16742 and have already done some research (thanks for that!), I think this conversation can keep going on there.

  2. Why did the update fail when the tag changed? This error comes from CloudFormation (and ultimately from the EKS control plane), so I don't have much insight here. I can create a ticket with the EKS team internally, but I would like to be able to reproduce the bug, before doing that. Could you share a short and self-contained example that demonstrates the issue?

github-actions[bot] commented 2 years ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

adriantaut commented 2 years ago

Thanks a lot for your reply, @otaviomacedo !

  1. I agree, it makes more sense.

  2. Sure thing, you can find a snippet below:

import { InstanceType, IVpc, SubnetType, SecurityGroup, Peer, Port, CfnLaunchTemplate, ISecurityGroup } from '@aws-cdk/aws-ec2';
import { Cluster, KubernetesVersion, EndpointAccess, Nodegroup } from '@aws-cdk/aws-eks';
import * as iam from '@aws-cdk/aws-iam';

export interface EksClusterProps {
  readonly vpc?: IVpc,
  readonly version?: SUPPORTED_KUBERNETES_VERSION,
  readonly nodeGroup: EksClusterNodeGroupProps,
}
type INSTANCE_METADATA_SERVICE = 'enabled' | 'disabled';
type SUPPORTED_KUBERNETES_VERSION = '1.18' | '1.19';

export interface EksClusterNodeGroupProps {
  /**
   * PPB EKS Optimized AMI ID - it needs to match with the Kubernetes Version of the Cluster
   */
  readonly imageId: string,
  /**
   * List of LDAP groups to grant access in the EKS NodeGroup instances
   */
  readonly ldapGroups?: string[],
  /**
   * Whether to enable/disable the access to Instance Metadata Service; having this enabled makes
   * all the pods to automatically inherit the nodeGroup IAM Role permissions, so if there is no
   * specific usecase, we strongly recommend having the IMDS disabled and use IAM Roles for k8s Service Accounts.
   *
   * @default 'disabled'
   */
  readonly instanceMetadataService?: INSTANCE_METADATA_SERVICE;
  /**
   * Instace Types of the EC2 instances in the NodeGroup
   */
  readonly instanceType?: InstanceType,
  /**
   * Minimum amount of EC2 instances in NodeGroup
   */
  readonly minSize?: number,
  /**
   * Maximum amount of EC2 instances in NodeGroup
   */
  readonly maxSize?: number,
}

export class EksCluster extends cdk.Construct {
  readonly account: string;
  /**
   * EKS Cluster instance
   */
  readonly cluster: Cluster;
  /**
   * EKS Cluster name
   */
  readonly clusterName: string;
  /**
   * EKS Cluster NodeGroup
   */
  readonly nodeGroup: Nodegroup;
  /**
   * EKS Cluster VPC
   */
  readonly vpc: IVpc;
  /**
   * Security group that is applied to both control plane and k8s nodes
   */
  private clusterSG: ISecurityGroup | undefined;

  constructor(scope: cdk.Construct, id: string, props: EksClusterProps) {
    super(scope, id);

    const stack = cdk.Stack.of(this);
    this.account = stack.account;
    this.clusterName = `cluster`;

    // props.vpc precedes props.vpcName which precedes the `default` vpc
    this.vpc = ec2.Vpc.fromLookup(this, 'vpc', {
      vpcId: 'vpc-id'
    });

    // instantiate the EKS Cluster
    this.cluster = this.createCluster(props, this.vpc);

    // add the NodeGroup according to user specifications
    this.nodeGroup = this.addNodeGroup(props.nodeGroup);
  }

    /**
   * Creates the EKS Cluster and provide the access mechanisms through IAM roles and SG rules.
   *
   * @param props EksClusterProps interface options
   * @param vpc the VPC to deploy into
   */
  private createCluster(props: EksClusterProps, vpc: IVpc) {
    // role with systems:master permissions, used to run kubectl commands
    // ToDo: find a way of restricting the usage of this role from Pipelines Deploying k8s applications
    const mastersRoleName = `K8sMastersRole`;
    const mastersRole = new iam.Role(this, 'Masters-Role', {
      roleName: mastersRoleName,
      assumedBy: new iam.AccountPrincipal(this.account),
    });

    // restrict the Control Plane SG; otherwise outbound is widely opened
    const controlPlaneSG = new SecurityGroup(this, 'ControlPlaneSecurityGroup', {
      vpc,
      securityGroupName: `${this.clusterName}-Control-Plane-SG`,
      description: 'EKS Control Plane Security Group - restricted',
      allowAllOutbound: false,
    });

    // cluster creation
    const cluster = new Cluster(this, 'Cluster', {
      clusterName: this.clusterName,
      version: KubernetesVersion.of(props.version ?? '1.19'),
      vpcSubnets: [{ subnetType: SubnetType.PRIVATE }],
      endpointAccess: EndpointAccess.PRIVATE, // no access outside of your VPC,
      vpc,
      mastersRole,
      securityGroup: controlPlaneSG,
    });

    // Security group applied to control plane, nodes and all pods that do not match explicit SecurityGroupPolicy
    this.clusterSG = SecurityGroup.fromSecurityGroupId(this, 'ClusterSecurityGroup', cluster.clusterSecurityGroupId, {
      allowAllOutbound: false,
    });

    // Allow communication between nodes and control plane (see: https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html)
    controlPlaneSG.connections.allowFrom(this.clusterSG, Port.tcp(443), 'Allow nodes to access apiserver');
    this.clusterSG.connections.allowFrom(controlPlaneSG, Port.tcpRange(1025, 65535), 'Allow management traffic from control plane to nodes');

    // Allow nodes accessing "EKS Lambdas" TODO what is EKS lambdas and why so open
    this.clusterSG.addEgressRule(Peer.anyIpv4(), Port.tcp(443), 'Needed by EKS Lambdas');

    // Allow nodes accessing Amazon2 Linux YUM repos TODO why so open
    this.clusterSG.addEgressRule(Peer.anyIpv4(), Port.tcp(80), 'Amazon2 Yum Repositories');

    // Allow inter-node communication for core components like autoscaler, external-dns, alb-controller that are deployed in a single worker node
    this.clusterSG.addEgressRule(this.clusterSG, Port.allTraffic(), 'Internode Communication');

    return cluster;
  }

  /**
   * Deploys the Compute part of the EKS Cluster through a single NodeGroup for the moment.
   * The NodeGroup needs the PPB EKS Optimized AMI ID, which needs to be set in the dependecies file
   * of the TLA repository, according with the EksClusterProps.version passed by user.
   *
   * @param props
   */
  private addNodeGroup(props: EksClusterNodeGroupProps): Nodegroup {
    // create a launch template, which will be responsible for bootstrapping the worker nodes
    const launchTemplate = new CfnLaunchTemplate(this, 'NodeGroup-LaunchTemplate', {
      launchTemplateData: {
        metadataOptions: {
          httpEndpoint: 'enabled',
          httpTokens: 'required', // enforce IMDSv2
          httpPutResponseHopLimit: ((props.instanceMetadataService ?? 'disabled') == 'disabled') ? 1 : 2, // hopLimit = 1 disables IMDSv2 too
        },
        userData: cdk.Fn.base64(this.nodeGroupUserData(props.ldapGroups)),
        imageId: props.imageId,
        instanceType: props?.instanceType?.toString(),
        tagSpecifications: [
          {
            resourceType: 'instance',
            tags: [{ key: 'aws-inspector', value: 'true' }],
          },
        ],
      },
    });

    const nodeGroup = this.cluster.addNodegroupCapacity('NodeGroup', {
      nodegroupName: `${this.clusterName}-NodeGroup`,
      minSize: props?.minSize,
      maxSize: props?.maxSize,
      subnets: { subnetType: SubnetType.PRIVATE },
      launchTemplateSpec: {
        id: launchTemplate.ref,
        version: launchTemplate.attrLatestVersionNumber,
      },
    });

    return nodeGroup;
  }
}

and its usage:

export class AppStack extends cdk.Stack {
    constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
      super(scope, id, props);

      const cluster = new EksCluster(this, 'awsrefeksCluster', { 
        ...props,
        version: '1.19',
        nodeGroup: {
          imageId: image.getImage(this).imageId,
          minSize: 1,
          maxSize: 2,
          instanceType: InstanceType.of(InstanceClass.C5, InstanceSize.LARGE),
        }
    });
    }
 }

const app = new cdk.App();
cdk.Tags.of(app).add('CiCd:CdkLibPpbVersion', '7.40.0');

new AppStack(app, 'app-stack');

now, if only that CiCd:CdkLibPpbversion Tag applied at Cdk App level gets changed from 7.40.0 -> 7.41.0 it will trigger a change both in EKS NodeGroup and in the EC2 LaunchTemplate as you can see in the initial message of this issue. Nothing gets changed except the tags, but the error surfaces:

AwsrefeksStack-ephemeral-tauta: creating CloudFormation changeset...
14:32:16 | UPDATE_FAILED        | AWS::EKS::Nodegroup                   | awsrefeksClusterNo...pNodeGroup36246D59
Version and ReleaseVersion updates cannot be combined with other updates

Hope this can help. Thank you a lot!

otaviomacedo commented 2 years ago

nodeGroupUserData doesn't exist:

userData: Fn.base64(this.nodeGroupUserData(props.ldapGroups)),

And what is image.getImage?

imageId: image.getImage(this).imageId,
adriantaut commented 2 years ago

Did not include those as was thinking are not really relevant for the issue.

The nodeGroupUserData() is the method for preparing the userdata of the workers while working with self managed EKS NodeGroups:

  /**
   * Puts together the EC2 user-data for bootstrapping the EKS Compute through NodeGroup nodes.
   *
   * @param ldapGroups a list of LDAP Groups that will have access to NodeGroup nodes
   */
  private nodeGroupUserData(ldapGroups?: string[]) {
    let loginAccessControlTable = '';
    let sudoersCommand = '';

    if (ldapGroups) {
      let sudoersRaw = '';
      loginAccessControlTable += '\\n#60 - EKS user defined access';
      for (var group of ldapGroups) {
        loginAccessControlTable += '\\n+ : (' + group + ') : ALL';
        sudoersRaw += '%' + group + '      ALL=(ALL)       ALL\\n';
      }
      sudoersCommand += `echo -e '${sudoersRaw}' > /etc/sudoers.d/75-owners`;
    }

    // ToDo: remove (CloudAutomation) once the CloudOps MR is merged and new AMIs are built
    const userData = `MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash -e

# configure LDAP access in the NodeGroup workers
sed -i '/+ : qualys : ALL/ a ${loginAccessControlTable}' /etc/security/access.conf
${sudoersCommand}

# register workers with the EKS Cluster
set -ex
B64_CLUSTER_CA=${this.cluster.clusterCertificateAuthorityData}
API_SERVER_URL=${this.cluster.clusterEndpoint}
/etc/eks/bootstrap.sh ${this.clusterName} --b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL

--==MYBOUNDARY==--`;

    return userData;
  }

and for the image.getImage() you can safely replace that with the AMI ID of an EKS-optimized AMI image provided by AWS.

markussiebert commented 2 years ago

@otaviomacedo there are several issues related to this behaviour. I think this is a cloudformation limitation ...

EKS has no api call to update nodegroup version and nodgroup config in one. There are only:

UpdateNodegroupConfig UpdateNodegroupVersion

While one is running, you will get the error, that only 1 update can be performed at a time, if you try to start the other one. So for eks everything is fine.

And I think to prevent errors (which one should be done first if both api calls have to be performed? What should be done if something goes wrong?) I think this "error" was introduced by cloudformation. I for my self created a custom resource performing the nodegroup update after the version upgrade ... somehow decouple both api calls (but I'm only 10 minutes away from the custom resource timeout of 1 hour).

But in this example, the Tag change leads to a change of the launchtemplate tag specifications (that creates a new launch template version) and change of the nodegroup tags itself. So both api calls have to be performed and this is what cloudformation won't do.

It's allways the same problem that there are many scenarios where you have to change both, but cloudformation won't handle this. Also see #13602

It would be nice if this could be handled by cloudformation. Maybe there should be a property on the cloudformation nodegroup resource like updatestrategy: configFirst|versionFirst ...

otaviomacedo commented 2 years ago

Thanks, @markussiebert. I think you're spot on! I've raised the issue internally with the CloudFormation team.

markussiebert commented 2 years ago

@otaviomacedo any news from the cloudformation team?

Hunter-Thompson commented 2 years ago

same issue here

caviliar commented 2 years ago

I have the same problem.

Just added cdk.Tags.of(this).add('foo', 'bar'); to my stack and got the Version and release version updates cannot be combined with other updates error.

Would making the node-group dependsOn the launch-template have any effect of ordering?

markussiebert commented 2 years ago

No, it won't Help... I think I analyzed it enough and AWS could solve it If they want - I think AWS thinks that this behaviour is near perfect... Otherwise they would Change it

David Talbot @.***> schrieb am Mi., 22. Juni 2022, 01:48:

I have the same problem.

Just added cdk.Tags.of(this).add('foo', 'bar'); to my stack and got the Version and release version updates cannot be combined with other updates error.

Would making the node-group dependsOn the launch-template have any effect of ordering?

— Reply to this email directly, view it on GitHub https://github.com/aws/aws-cdk/issues/16644#issuecomment-1162469083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7LKUDFI3XRTNVVQ2M6TIDVQJIFDANCNFSM5EV5JPPA . You are receiving this because you were mentioned.Message ID: @.***>

damienpuig commented 2 years ago

@otaviomacedo Hi! Any news from the team(s)? Thanks

adriantaut commented 1 year ago

Hello @otaviomacedo , any signs of movement on this one? Cheers!

dbaumgarten commented 1 year ago

Has anyone at least found some workaround for this issue? Would it help to exclude the NodeGroup or the LaunchTemplate from the tagging? Is there a way to do this without having the Instances created by the ASG loose their tags?

adriantaut commented 1 year ago

https://github.com/aws/aws-cdk/issues/16742 @dbaumgarten this issue was also raised...nope we don't really have a solution/workaround yet

dbaumgarten commented 1 year ago

Is there at least a way I can (semi)manually change the tags of my ressources? In a way that works fine with future runs of cdk? I am currently in the situation were I would need to change some tags and can't do this without completely re-creating the whole stack...

Maybe something like:

  1. Manually change the tags of the nodegroup
  2. Change the cdk-script and run it
markussiebert commented 1 year ago

I had created a custom resource lambda. Maybe I can publish it. It handles the nodegroup updates for us...

Am Mo., 23. Jan. 2023 um 15:59 Uhr schrieb dbaumgarten < @.***>:

Is there at least a way I can (semi)manually change the tags of my ressources? In a way that works fine with future runs of cdk? I am currently in the situation were I would need to change some tags and can't do this without completely re-creating the whole stack...

— Reply to this email directly, view it on GitHub https://github.com/aws/aws-cdk/issues/16644#issuecomment-1400491229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7LKUDG3GVNP7WN3B3DSD3WT2MFNANCNFSM5EV5JPPA . You are receiving this because you were mentioned.Message ID: @.***>

-- Freundliche Grüße

Markus Siebert

andrewnitu commented 1 year ago

I had created a custom resource lambda. Maybe I can publish it. It handles the nodegroup updates for us...

@markussiebert did you end up publishing it? Would be a useful workaround for now...

ThomasSteinbach commented 1 year ago

AWS CDK seems to reveal so many flaws of CloudFormation. Since it could take years until AWS would fix all that bugs (or build a better base for turning the actual state to the target state), we removed as workaround tags from the NodeGroup (and also the problematic CfnUserGroup) with following aspect on our Stack construct:

@jsii.implements(aws_cdk.IAspect)
class TagRemoverAspect:
    def visit(self, node):
        # Per default we exclude following resource types from tagging:
        # * Elasticache::UserGroup - has the error: "extraneous key [tags] is not
        if isinstance(node, aws_cdk.aws_elasticache.CfnUserGroup):
            node.add_deletion_override("Properties.Tags")
        if isinstance(node, aws_cdk.aws_eks.Nodegroup):
            node.node.default_child.add_deletion_override("Properties.Tags")

aws_cdk.Aspects.of(self).add(TagRemoverAspect())
portswigger-tim commented 8 months ago

I worked around this by enabling myself to pass in a static launch template version number to my EKS stack (set to the current version) and then remove it to update to the latest version.

2 CDK runs, but it worked and gives me an escape hatch in the future.

christallire commented 8 months ago

lastestVersionNumber or versionNumber causes trouble. use defaultVersionNumber and then manually change it on AWS console.