aws-samples / cdk-eks-karpenter

CDK construct for installing and configuring Karpenter on EKS clusters
Apache License 2.0
34 stars 14 forks source link

Karpenter Creates a new role, instead of using custom EC2 node class role #157

Closed Anathema-Odious closed 3 months ago

Anathema-Odious commented 5 months ago

We are using this library with version 1.0.3, and we are facing an issue, where karpenter fails to scale up. On looking at the error messages, this is the error that we got: {"level":"ERROR","time":"2024-01-12T11:35:33.053Z","logger":"controller","message":"Reconciler error","commit":"a70b39e","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"test-node-class-spec"},"namespace":"","name":"test-node-class-spec","reconcileID":"3d054015-5f0a-4ec2-8f1c-51359aaba044","error":"creating instance profile, adding role \"xxxx-xxxx-test\" to instance profile \"xxxxxx-test_10180655167987829993\", AccessDenied: User: arn:aws:sts::xxxxxxxxxxx:assumed-role/xxxxxxxxxxxxxxxxxxxxxxxxxxx-VURLFdcMJOnq/1705059316694610615 is not authorized to perform: iam:PassRole on resource: arn:aws:iam::xxxxxxxxxxx:role/xxxx-xxxx-test because no identity-based policy allows the iam:PassRole action\n\tstatus code: 403, request id: 6b879c29-3f3c-4084-9fb2-1177b0b1c751"}

xxxx-xxxx-test is the custom that we provide in role field for test-node-class-spec .

On deeper inspection of the policy, a new node role is getting created which has the permission for iam:PassRole, instead of using the custom role.

Below is the construct that we use in our cdk(python) block, for our node class definition

test_node_class_spec = {
    "subnetSelectorTerms": [
        {"id": subnet.subnet_id} for subnet in vpc.private_subnets
    ],
    "securityGroupSelectorTerms": [
        {"tags": {"aws:eks:cluster-name": eks_cluster.cluster_name}}
    ],
    "amiFamily": "Custom",
    "amiSelectorTerms": [{"id": worker_nodes.get("general_nodes").get("ami_id")}],
    "blockDeviceMappings": [
        {
            "deviceName": "/dev/sda1",  # '/dev/xvda' - for amazon linux
            "ebs": {
                "volumeSize": "1000Gi",
                "volumeType": "gp3",
                "deleteOnTermination": True,
                "encrypted": True,
                "kmsKeyID": key.key_id,
            },
        }
    ],
    "userData": user_data,
    "metadataOptions": {
        "httpEndpoint": "enabled",
        "httpProtocolIPv6": "disabled",
        "httpPutResponseHopLimit": 2,
        "httpTokens": "required",
    },  # optional, configures IMDS for the instance
    "role": role.role_name, # This will resolve to custom role, created by our cdk deployment
}

Expected behavior: the library should re-use the provided custom role, instead of defining a new node role.

This was previously resolved, but has crept up again in the version change, probably when the support for Karpenter Beta release was added.

andskli commented 5 months ago

Hi @Anathema-Odious! Thanks for filing this issue. Could you share your more (redact info as needed) of the CDK details surrounding the node class spec? For example, where does role.role_name come from and resolve to, and I would also be interested in understanding what the created manifests k8s looks like to better understand the issue so if you can share that it would be helpful.

badaldavda8 commented 5 months ago

Hello @andskli The role.role_name is a custom worker role which we need to access different other services with a custom name as per our organization naming standards.

We have observed that even after adding it in the NodeClass, it creates a new role even though I have passed the role in EC2NodeClass. I did not add the role in nodeRole as before in karpenter definition since it created InstanceProfile which conflicts with EC2NodeClass Role which also creates Instance Profile and therefore doesn't work.

https://github.com/aws-samples/cdk-eks-karpenter/blob/main/src/index.ts#L628-L630

We will need to remove either one of the two implementations of InstanceProfile. In case EC2NodeClass already has Role, we will need to see how that can be managed.

Anathema-Odious commented 5 months ago

created manifests k8s looks

Hi @andskli, I am not sure how to get the generated k8s manifest. But I think the issue that @badaldavda8 pointed out above more or less summarizes this. I will share the manifests as soon as I get my hands on them.

andskli commented 5 months ago

The addEC2NodeClass method just takes Record<string, any> and creates the k8s manifest based on your input, and does nothing else. To my understanding, when you then pass spec.role Karpenter will try to create an InstanceProfile for this role to attach to the owned nodes, see Karpenter docs.

I think the workaround here would be to create your custom Role and corresponding InstanceProfile outside of the construct, and specify spec.instanceProfile instead of spec.role in the NodeClass.

This should have the construct sort out the aws-auth ConfigMap with your custom Role, and add the IAM policy statement (i.e. PassRole) from L630 to that Role.

badaldavda8 commented 5 months ago

Sure, we will try this out.

But I am still thinking if it will modify the IAM Policy statement since we will not be passing any nodeRole at all. It will create the nodeRole and then add it in the PassRole since we are not adding nodeRole in the Props.

andskli commented 5 months ago

You are right and I should've mentioned in my previous reply in the workaround section that you should also pass that role as the nodeRole

Anathema-Odious commented 5 months ago

@andskli, following your above recommendation to pass in the same nodeRole, while instantiating the Karpenter object, as the one used in the spec.role for the node_class_spec, worked. I believe it forced the library to use the same node role for using the iam:PassRole action.

Below is a snapshot of the cdk diff output. image

I didn't need to create an instance profile for this to work.

andskli commented 5 months ago

@Anathema-Odious glad to hear it worked out.

I believe this is a documentation issue for now, as the construct allows free-form NodeClass spec's. Given beta graduation of Karpenter API's, it may make sense to expose more of Karpenter's functionality through this constructs APIs in the future, but is yet to be explored.