aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
436 stars 192 forks source link

karpenter: example doesn't work #1038

Closed NinoSkopac closed 1 week ago

NinoSkopac commented 2 weeks ago

Describe the bug

Example from https://aws-quickstart.github.io/cdk-eks-blueprints/addons/karpenter/#usage doesn't work.

Expected Behavior

karpenter deploys (it used to, before v1beta changes)

Current Behavior

balks

Reproduction Steps

run the code from https://aws-quickstart.github.io/cdk-eks-blueprints/addons/karpenter/#usage

const app = new cdk.App();
const region = 'us-east-1';
const clusterName = 'foobar-k8';

const karpenterAddOn = new blueprints.addons.KarpenterAddOn({
    version: 'v0.33.1',
    nodePoolSpec: {
        labels: {
            type: "karpenter-test"
        },
        annotations: {
            "eks-blueprints/owner": "young"
        },
        taints: [{
            key: "workload",
            value: "test",
            effect: "NoSchedule",
        }],
        requirements: [
            {
                key: 'node.kubernetes.io/instance-type',
                operator: 'In',
                values: [
                    't3.medium', 't4g.medium'
                ]
            },
            { key: 'topology.kubernetes.io/zone', operator: 'In', values: ['us-east-1a','us-east-1b', 'us-east-1c']},
            { key: 'kubernetes.io/arch', operator: 'In', values: ['amd64','arm64']},
            { key: 'karpenter.sh/capacity-type', operator: 'In', values: ['on-demand', 'spot']},
        ],
        disruption: {
            consolidationPolicy: "WhenEmpty",
            consolidateAfter: "30s",
            expireAfter: "20m",
            // offending part:
            budgets: [{nodes: "10%"}]
        },
    },
    ec2NodeClassSpec: {
        amiFamily: "AL2",
        subnetSelectorTerms: [{ tags: { "karpenter.sh/discovery": `${clusterName}` }}],
        securityGroupSelectorTerms: [{ tags: { "karpenter.sh/discovery": `${clusterName}` }}], 
    },
    interruptionHandling: true,
    podIdentity: true, // Recommended, otherwise, set false (as default) to use IRSA
});
const addOns: Array<blueprints.ClusterAddOn> = [
    new blueprints.addons.ArgoCDAddOn(),
    new blueprints.addons.CalicoOperatorAddOn(),
    new blueprints.addons.MetricsServerAddOn(),
    new blueprints.addons.CoreDnsAddOn(),
    new blueprints.addons.AwsLoadBalancerControllerAddOn(),
    new blueprints.addons.VpcCniAddOn(),
    new blueprints.addons.KubeProxyAddOn(),
    karpenterAddOn,
    new blueprints.addons.CloudWatchLogsAddon({
        namespace: 'aws-for-fluent-bit',
        createNamespace: true,
        serviceAccountName: 'aws-fluent-bit-for-cw-sa',
        logGroupPrefix: `/aws/eks/${clusterName}`,
        logRetentionDays: 7
    })
];
// const platformTeam = new TeamPlatform(process.env.AWS_ACCOUNT_ID);
const stack = blueprints.EksBlueprint.builder()
    .account(process.env.AWS_ACCOUNT_ID as string)
    .region(region)
    .addOns(...addOns)
    .useDefaultSecretEncryption(true) // set to false to turn secret encryption off (non-production/demo cases)
    .version(KubernetesVersion.V1_28)
    .build(app, clusterName);

Offending part:

        disruption: {
            consolidationPolicy: "WhenEmpty",
            consolidateAfter: "30s",
            expireAfter: "20m",
            budgets: [{nodes: "10%"}]
        },

Error:

You cannot set disruption budgets for this version of Karpenter. Please upgrade to 0.34.0 or higher. at KarpenterAddOn.versionFeatureChecksForError

After commenting budgets out, another error:

        disruption: {
            consolidationPolicy: "WhenEmpty",
            consolidateAfter: "30s",
            expireAfter: "20m",
            // budgets: [{nodes: "10%"}]
        },
'Release "karpenter" does not exist. Installing it
now.\nError: Unable to continue with install: ServiceAccount "blueprints-addon-karpenter" in namespace "karpenter" exists and cannot
be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by":
must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "karpenter"; annotation v
alidation error: missing key "meta.helm.sh/release-namespace": must be set to "karpenter"

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.148.0 (build e5740c0)

EKS Blueprints Version

1.15.1

Node.js Version

v22.4.1

Environment details (OS name and version, etc.)

Darwin

Other information

No response

NinoSkopac commented 2 weeks ago

LOL the cause is podIdentity: true

AKA it deploys after I comment it out

What I saw on https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/ made me try it:

Pod Identity should work though:

So it seems like PodIdentity is the problem?

NinoSkopac commented 2 weeks ago

OK guys key is to provision EksPodIdentityAgentAddOn first. Then, you can ditch IRSA. Example still won't work verbatim - just comment out budgets :D

NinoSkopac commented 2 weeks ago

Well, it can't schedule pods.

{"level":"ERROR","time":"2024-07-10T13:03:07.883Z","logger":"controller","message":"failed listing instance types for default-nodepool","commit":"490ef94","controller":"disruption","error":"no subnets found"}

{"level":"ERROR","time":"2024-07-10T13:07:07.466Z","logger":"controller","message":"could not schedule pod","commit":"490ef94","controller":"provisioner","Pod":{"name":"inflate-65f46bdc4-drb6c","namespace":"default"},"error":"all available instance types exceed limits for nodepool: \"default-nodepool\""}

{"level":"ERROR","time":"2024-07-10T13:07:11.220Z","logger":"controller","message":"skipping, unable to resolve instance types","commit":"490ef94","controller":"provisioner","NodePool":{"name":"default-nodepool"},"error":"no subnets found"}
shapirov103 commented 2 weeks ago

@NinoSkopac we are looking into this issue. Pod identity support was released in 1.15, however we need to make sure the examples work as expected and pods are schedulable.

NinoSkopac commented 2 weeks ago

Not working with either irsa or pod identity

NinoSkopac commented 2 weeks ago

Thanks for taking the time to look into it, hopefully it can be resolved with urgency because I can only get this working on eks 1.27 which will enter extended support in 2 weeks, and I'm not going to pay $480/m instead of $80/m for a eks cluster.

zjaco13 commented 2 weeks ago

@NinoSkopac, I am attempting to reproduce the issue where you could not get the pods scheduling. Could you please share the steps you took to get from 'Release "karpenter" does not exist. Installing it now.:Error to the pod scheduling issue?

vumdao commented 2 weeks ago
NinoSkopac commented 2 weeks ago

@zjaco13 You can either comment out podIdentity: true or provision EksPodIdentityAgentAddOn before Karpenter addon. Former will make Karpenter use IRSA, latter is Pod Identity. According to devs in this issue, use IRSA as Pod Identity is untested? Thanks for looking into it

@vumdao I'm creating a cluster from scratch, so I don't think I need installCRDs, that's only for upgrading from a lower version of Karpenter? Thanks for pointing me to that obscure chapter of the troubleshooting page. Is that something that could fix my last error, which is Karpenter can't schedule pods?

vumdao commented 2 weeks ago

@NinoSkopac I thought you upgraded Karpenter.

no subnets found = Check subnets in your cluster if well-tagging with karpenter.sh/discovery: ${clusterName} ?

zjaco13 commented 2 weeks ago

@NinoSkopac, I was able to successfully schedule the pods and scale the cluster using this karpenter addon spec.

new blueprints.addons.KarpenterAddOn({
        version: 'v0.33.1',
        nodePoolSpec: {
            labels: {
                type: "karpenter-test"
            },
            annotations: {
                "eks-blueprints/owner": "young",
            },
            taints: [{
                key: "workload",
                value: "test",
                effect: "NoSchedule",
            }],
            requirements: [
                { key: 'node.kubernetes.io/instance-type', operator: 'In', values: ['m5.large'] },
                { key: 'topology.kubernetes.io/zone', operator: 'In', values: ['us-west-2a', 'us-west-2b', 'us-west-2c'] },
                { key: 'kubernetes.io/arch', operator: 'In', values: ['amd64', 'arm64'] },
                { key: 'karpenter.sh/capacity-type', operator: 'In', values: ['on-demand'] },
            ],
            disruption: {
                consolidationPolicy: "WhenEmpty",
                consolidateAfter: "30s",
                expireAfter: "20m",
                //budgets: [{ nodes: "10%" }]
            }
        },
        ec2NodeClassSpec: {
            amiFamily: "AL2",
            subnetSelectorTerms: [{ tags: { "Name": `${clusterName}/${clusterName}-vpc/PrivateSubnet*` }}],
            securityGroupSelectorTerms: [{ tags: { "aws:eks:cluster-name": `${clusterName}` }}],
        },
        interruptionHandling: true,
        podIdentity: false, // Recommended, otherwise, set false (as default) to use IRSA
    });
NinoSkopac commented 2 weeks ago

I'll give it a go, thank you.

I believe it's this that made you able to run it:

subnetSelectorTerms: [{ tags: { "Name": `${clusterName}/${clusterName}-vpc/PrivateSubnet*` }}],
securityGroupSelectorTerms: [{ tags: { "aws:eks:cluster-name": `${clusterName}` }}],

I didn't try this exact combo, i tried something veeery similar

NinoSkopac commented 1 week ago

Tested and works!

Thank you!

Could you please do the same for eks v1.30 and karpenter 0.37.0 please? It would be great.