aws-samples / cdk-eks-karpenter

CDK construct for installing and configuring Karpenter on EKS clusters
Apache License 2.0
34 stars 14 forks source link

Stack removal fails #166

Open lucavb opened 4 months ago

lucavb commented 4 months ago

Hey,

we have been using cdk-eks-karpenter for a while now and we have been experiencing issues with the removal of stacks where karpenter has been installed using this package. Basically CloudFormation triggers the delete on the CustomResource which installed the yaml file into the cluster that then fails / times out. In the EKS console all the nodes have already been removed and the cluster is pretty much only still existing on paper (but I cannot connect with kubectl to it anymore). Eventually the CustomResource times out after 1h and CloudFormation fails.

We have produced this sort of minimal example where the error still occurs and where we do nothing more than just creating a cluster within our pre-created VPC and then install karpenter using this package.

import { CONFIG } from '@/src/config';
import { vpcName } from '@/src/utils';
import { KubectlV28Layer } from '@aws-cdk/lambda-layer-kubectl-v28';
import { Stack, StackProps } from 'aws-cdk-lib';
import { InstanceClass, InstanceSize, InstanceType, IVpc, Vpc } from 'aws-cdk-lib/aws-ec2';
import { Cluster, KubernetesVersion } from 'aws-cdk-lib/aws-eks';
import { ManagedPolicy } from 'aws-cdk-lib/aws-iam';
import { Karpenter } from 'cdk-eks-karpenter';
import { Construct } from 'constructs';

export class NodeAutoscaling extends Construct {
    constructor(
        scope: Construct,
        id: string,
        {
            cluster,
            subnetIds,
        }: {
            cluster: Cluster;
            subnetIds: string[];
        },
    ) {
        super(scope, id);

        const karpenter = new Karpenter(this, 'Karpenter', {
            cluster,
            namespace: 'karpenter',
            version: 'v0.34.1',
        });

        const nodeClass = karpenter.addEC2NodeClass('nodeclass', {
            amiFamily: 'AL2',
            subnetSelectorTerms: subnetIds.map((subnetId) => ({ id: subnetId })),
            securityGroupSelectorTerms: [
                {
                    tags: {
                        'aws:eks:cluster-name': cluster.clusterName,
                    },
                },
            ],
            role: karpenter.nodeRole.roleName,
        });

        karpenter.addNodePool('nodepool', {
            template: {
                spec: {
                    nodeClassRef: {
                        apiVersion: 'karpenter.k8s.aws/v1beta1',
                        kind: 'EC2NodeClass',
                        name: nodeClass.name,
                    },
                    requirements: [
                        {
                            key: 'karpenter.sh/capacity-type',
                            operator: 'In',
                            values: ['on-demand'],
                        },
                        {
                            key: 'karpenter.k8s.aws/instance-category',
                            operator: 'In',
                            values: ['m'],
                        },
                        {
                            key: 'karpenter.k8s.aws/instance-generation',
                            operator: 'In',
                            values: ['5', '6', '7'],
                        },
                        {
                            key: 'kubernetes.io/arch',
                            operator: 'In',
                            values: ['amd64'],
                        },
                    ],
                },
            },
        });

        karpenter.addManagedPolicyToKarpenterRole(
            ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'),
        );
    }
}

export class EksCluster extends Construct {
    public readonly cluster: Cluster;

    constructor(
        scope: Construct,
        id: string,
        {
            environment,
            instanceName,
            vpc,
        }: {
            environment: string;
            instanceName: string;
            vpc: IVpc;
        },
    ) {
        super(scope, id);

        const kubectlLayer = new KubectlV28Layer(this, 'KubectlLayer');

        this.cluster = new Cluster(this, 'Cluster', {
            clusterName: `eks-example-${instanceName}-${environment}`,
            defaultCapacity: 3,
            defaultCapacityInstance: InstanceType.of(InstanceClass.M5, InstanceSize.LARGE),
            kubectlLayer,
            outputConfigCommand: true,
            outputMastersRoleArn: true,
            version: KubernetesVersion.V1_28,
            vpc,
        });

        new NodeAutoscaling(this, 'NodeAutoscaling', {
            cluster: this.cluster,
            subnetIds: vpc.privateSubnets.map(({ subnetId }) => subnetId), // the landing zone creates the subnets in the following pattern <vpcId>-<private|public>-<AZ>
        });
    }
}

export class MinBrokenEks extends Stack {
    constructor(scope: Construct, id: string, props: StackProps) {
        super(scope, id, props);

        const vpc = Vpc.fromLookup(this, 'Vpc', { vpcName: vpcName(CONFIG.environment) });

        this.configureClusterAndRoles({ vpc });
    }

    private configureClusterAndRoles({ vpc }: { vpc: IVpc }) {
        const cluster = new EksCluster(this, 'EksCluster', {
            environment: CONFIG.environment,
            instanceName: CONFIG.instanceName,
            vpc,
        });

        return cluster;
    }
}
andskli commented 3 months ago

Hi @lucavb, thanks for reporting this! Could you please clarify which resource you are referring to that fails to delete here? Is it the helm resource which installs Karpenter?

lucavb commented 3 months ago

Hey @andskli it seems to be the CustomResource that installs either the EC2NodeClass or the NodePool. As I have said the cluster is basically without nodes at that point and the custom resource that should remove those two resources just times out. Does that help?

Edit: So I recreated my example in our account and here you see the failed resource:

The resource that could not be removed:

Screenshot 2024-03-19 at 20 51 10

And the lambda that times out:

Screenshot 2024-03-19 at 20 53 48
ltamrazov commented 3 months ago

I'm not sure if this is related, but we also just ran into an issue deleting the stack. In our case it failed on trying to delete the NodeRole and a NodeClass. In cloudfromation the event error message points to the instance profile:

Karpenter Node Role:

Resource handler returned message: "Cannot delete entity, must remove roles from instance profile first.

A node class that we provisioned using karpenter.addNodeClass

CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [0a56d113-0455-4c30-bca2-9b64cb2be7fa]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.

All we did in this case was add karpenter to an existing stack, provision a node class to test, and then tried to tear it down.

Screenshot 2024-03-22 at 12 33 09 PM Screenshot 2024-03-22 at 12 32 57 PM

lucavb commented 2 weeks ago

@andskli is there any update on this?

andskli commented 1 week ago

Have not had much time to look at this. Had a quick check-in and I am able to reproduce using your example, so thanks for that @lucavb.

Leaving the following as a note to future self or anyone willing to pick this issue up in the next few weeks as I won't be able to (summer holiday):

What seems to happen is that the EC2NodeClass doesn't get deleted because of the finalizer applied to the resource. I am not sure exactly how to solve this, perhaps we can utilize dependencies between NodePool and EC2NodeClass in a clever way somehow, or perhaps we can work on getting a force delete/remove finalizer option into upstream CDK resource which addEC2NodeClass() and addNodePool() uses under the hood.