eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.92k stars 1.41k forks source link

[Bug] User defined multipart User Data scripts do not work properly - breaks nodeadm functionality #7895

Open bradwatsonaws opened 3 months ago

bradwatsonaws commented 3 months ago

What were you trying to accomplish?

I am trying to create a manage node group with my own multipart user data script as part of an overrideBootstrapCommand. This multipart user date script should run a mix of bash commands and also fulfill requirement for nodeadm node initialization.

What happened?

When eksctl creates the launch template and takes the user data script defined by the user, it appears to add it's own multipart boundaries, which prevent the user defined multipart user data script from working as expected. The result is that the node group is created with a launch template as per usual. However, the nodes are unable to join the cluster because nodeadm defaults to using imds for its configuration, and the eksctl created boundaries of the multipart user data script prevent nodeadm from finding a configuration in imds.

Example user defined multipart user data script passed into overrideBootstrapCommand:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY"

--BOUNDARY
Content-Type: application/node.eks.aws

---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
cluster:
    name: rhel-eks
    apiServerEndpoint: https://myclusterapi.gr7.us-gov-east-1.eks.amazonaws.com
    certificateAuthority: mysuperlongcertificatexyzabc
    cidr: 10.100.0.0/16

--BOUNDARY
Content-Type: text/x-shellscript;

#!/bin/bash
set -ex
systemctl enable kubelet.service
systemctl disable nm-cloud-setup.timer
systemctl disable nm-cloud-setup.service
reboot

--BOUNDARY--

Resulting user data script created by eksctl in the node group launch template:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=478b56b7f407b2f8102862b68821d558cacbdf7575b0163bf3b5b98566a8

--478b56b7f407b2f8102862b68821d558cacbdf7575b0163bf3b5b98566a8
Content-Type: text/x-shellscript
Content-Type: charset="us-ascii"

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY"

--BOUNDARY
Content-Type: application/node.eks.aws

---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
cluster:
    name: rhel-eks
    apiServerEndpoint: https://myclusterapi.gr7.us-gov-east-1.eks.amazonaws.com
    certificateAuthority: mysuperlongcertificatexyzabc
    cidr: 10.100.0.0/16

--BOUNDARY
Content-Type: text/x-shellscript;

#!/bin/bash
set -ex
systemctl enable kubelet.service
systemctl disable nm-cloud-setup.timer
systemctl disable nm-cloud-setup.service
reboot

--BOUNDARY--

--478b56b7f407b2f8102862b68821d558cacbdf7575b0163bf3b5b98566a8--

As you can hopefully see, eksctl is generating it's own multipart script with it's own uniquely generated boundaries. This prevents the user defined boundaries from being respected.

How to reproduce it?

A zsh script with paramaters passed in that match the parameters defined at the top of this script:

#!/bin/zsh

EKS_CLUSTER=$1
AMI_ID=$2
MANAGED_NODE_GROUP=$3
AWS_REGION=$4
KEY_PAIR=$5
INSTANCE_TYPE=$6
MIN_SIZE=$7
DESIRED_SIZE=$8
MAX_SIZE=$9
API_ENDPOINT=$10
CIDR=$11
CERTIFICATE=$12
DATE_TIME=$(date +'%Y%m%d%H%M')

cat > managednodegroup-$DATE_TIME.yaml << EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $EKS_CLUSTER
  region: $AWS_REGION

managedNodeGroups:
  - name: $MANAGED_NODE_GROUP
    minSize: $MIN_SIZE
    desiredCapacity: $DESIRED_SIZE
    maxSize: $MAX_SIZE
    ami: $AMI_ID
    amiFamily: AmazonLinux2023
    instanceType: $INSTANCE_TYPE
    labels:
      role: worker
    tags:
      nodegroup-name: $MANAGED_NODE_GROUP
    privateNetworking: true

    overrideBootstrapCommand: |
      MIME-Version: 1.0
      Content-Type: multipart/mixed; boundary="BOUNDARY"

      --BOUNDARY
      Content-Type: application/node.eks.aws

      ---
      apiVersion: node.eks.aws/v1alpha1
      kind: NodeConfig
      spec:
        cluster:
          name: $EKS_CLUSTER
          apiServerEndpoint: $API_ENDPOINT
          certificateAuthority: $CERTIFICATE
          cidr: $CIDR

      --BOUNDARY
      Content-Type: text/x-shellscript;

      #!/bin/bash
      set -ex
      systemctl enable kubelet.service
      systemctl disable nm-cloud-setup.timer
      systemctl disable nm-cloud-setup.service
      reboot

      --BOUNDARY--
EOF

eksctl create nodegroup --config-file=managednodegroup-$DATE_TIME.yaml --cfn-disable-rollback

Logs 2024-07-18 08:51:16 [ℹ] will use version 1.29 for new nodegroup(s) based on control plane version 2024-07-18 08:51:18 [ℹ] nodegroup "rhel-eks-nodeadmn-new" will use "ami-095c7b500f70da3d0" [AmazonLinux2/1.29] 2024-07-18 08:51:18 [ℹ] 2 existing nodegroup(s) (rhel-eks-github,rhel-eks-nodeadm) will be excluded 2024-07-18 08:51:18 [ℹ] 1 nodegroup (rhel-eks-nodeadmn-new) was included (based on the include/exclude rules) 2024-07-18 08:51:18 [ℹ] will create a CloudFormation stack for each of 1 managed nodegroups in cluster "rhel-eks" 2024-07-18 08:51:19 [ℹ] 2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create managed nodegroup "rhel-eks-nodeadmn-new" } } } 2024-07-18 08:51:19 [ℹ] checking cluster stack for missing resources 2024-07-18 08:51:19 [ℹ] cluster stack has all required resources 2024-07-18 08:51:19 [ℹ] building managed nodegroup stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:51:20 [ℹ] deploying stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:51:20 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:51:50 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:52:42 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:54:03 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:55:08 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:56:09 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:56:59 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:58:12 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 08:59:49 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:00:26 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:01:30 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:02:30 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:03:54 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:05:07 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:06:54 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:08:02 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:09:12 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:10:27 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:12:07 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:13:38 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:14:53 [ℹ] waiting for CloudFormation stack "eksctl-rhel-eks-nodegroup-rhel-eks-nodeadmn-new" 2024-07-18 09:14:53 [ℹ] 1 error(s) occurred and nodegroups haven't been created properly, you may wish to check CloudFormation console 2024-07-18 09:14:53 [ℹ] to cleanup resources, run 'eksctl delete nodegroup --region=us-gov-east-1 --cluster=rhel-eks --name=' for each of the failed nodegroup 2024-07-18 09:14:53 [✖] waiter state transitioned to Failure Error: failed to create nodegroups for cluster "rhel-eks"

Anything else we need to know? OS: MacOS Authentication: SSO through AWS CLI and Okta

Versions 0.187.0

$ eksctl info
github-actions[bot] commented 3 months ago

Hello bradwatsonaws :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

bradwatsonaws commented 3 months ago

By taking the resulting CloudFormation that comes from eksctl, I was able to deploy a nodegroup successfully that doesn't wrap the user data with a unique BOUNDARY. All unique values below were scrubbed.


AWSTemplateFormatVersion: '2010-09-09'
Description: 'EKS Managed Nodes (SSH access: false)'
Mappings:
  ServicePrincipalPartitionMap:
    aws:
      EC2: ec2.amazonaws.com
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
    aws-cn:
      EC2: ec2.amazonaws.com.cn
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
    aws-iso:
      EC2: ec2.c2s.ic.gov
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
    aws-iso-b:
      EC2: ec2.sc2s.sgov.gov
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
    aws-us-gov:
      EC2: ec2.amazonaws.com
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
Resources:
  LaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/sda1
            Ebs:
              Encrypted: false
              Iops: 3000
              Throughput: 125
              VolumeSize: 80
              VolumeType: gp3
        ImageId: ami-0b2e91234574a54c0
        MetadataOptions:
          HttpPutResponseHopLimit: 2
          HttpTokens: required
        SecurityGroupIds:
          - !ImportValue 'eksctl-rhel-eks-cluster::ClusterSecurityGroupId'
        TagSpecifications:
          - ResourceType: instance
            Tags:
              - Key: Name
                Value: rhel-eks-rhel-eks-cfn-Node
              - Key: alpha.eksctl.io/nodegroup-type
                Value: managed
              - Key: nodegroup-name
                Value: rhel-eks-cfn
              - Key: alpha.eksctl.io/nodegroup-name
                Value: rhel-eks-cfn
          - ResourceType: volume
            Tags:
              - Key: Name
                Value: rhel-eks-rhel-eks-cfn-Node
              - Key: alpha.eksctl.io/nodegroup-type
                Value: managed
              - Key: nodegroup-name
                Value: rhel-eks-cfn
              - Key: alpha.eksctl.io/nodegroup-name
                Value: rhel-eks-cfn
          - ResourceType: network-interface
            Tags:
              - Key: Name
                Value: rhel-eks-rhel-eks-cfn-Node
              - Key: alpha.eksctl.io/nodegroup-type
                Value: managed
              - Key: nodegroup-name
                Value: rhel-eks-cfn
              - Key: alpha.eksctl.io/nodegroup-name
                Value: rhel-eks-cfn
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="BOUNDARY"

            --BOUNDARY
            Content-Type: application/node.eks.aws

            ---
            apiVersion: node.eks.aws/v1alpha1
            kind: NodeConfig
            spec:
              cluster:
                name: rhel-eks
                apiServerEndpoint: https://5B3FABCDE05F2D983E65079309B80C06.gr7.us-gov-east-1.eks.amazonaws.com
                certificateAuthority: LS0tLS1CRULS0tLS0K
                cidr: 10.100.0.0/16

            --BOUNDARY
            Content-Type: text/x-shellscript;

            #!/bin/bash
            set -ex
            systemctl enable kubelet.service
            systemctl disable nm-cloud-setup.timer
            systemctl disable nm-cloud-setup.service
            reboot

            --BOUNDARY--
      LaunchTemplateName: !Sub '${AWS::StackName}'
  ManagedNodeGroup:
    Type: AWS::EKS::Nodegroup
    Properties:
      ClusterName: rhel-eks
      InstanceTypes:
        - t3.medium
      Labels:
        alpha.eksctl.io/cluster-name: rhel-eks
        alpha.eksctl.io/nodegroup-name: rhel-eks-cfn
        role: worker
      LaunchTemplate:
        Id: !Ref 'LaunchTemplate'
      NodeRole: !GetAtt 'NodeInstanceRole.Arn'
      NodegroupName: rhel-eks-cfn
      ScalingConfig:
        DesiredSize: 2
        MaxSize: 2
        MinSize: 2
      Subnets:
        - subnet-0f034415c5b1237f0
        - subnet-0bdba07340be1232f
        - subnet-05c651fa62a123b2c
      Tags:
        alpha.eksctl.io/nodegroup-name: rhel-eks-cfn
        alpha.eksctl.io/nodegroup-type: managed
        nodegroup-name: rhel-eks-cfn
  NodeInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action:
              - sts:AssumeRole
            Effect: Allow
            Principal:
              Service:
                - !FindInMap
                  - ServicePrincipalPartitionMap
                  - !Ref 'AWS::Partition'
                  - EC2
        Version: '2012-10-17'
      ManagedPolicyArns:
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly'
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonEKSWorkerNodePolicy'
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonEKS_CNI_Policy'
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore'
      Path: /
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}/NodeInstanceRole'
github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stalled for 5 days with no activity.