aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.51k stars 3.85k forks source link

aws-eks: cdk should validate cluster version and kubectl layer version #24580

Open trondhindenes opened 1 year ago

trondhindenes commented 1 year ago

Describe the bug

Ever since we upgraded from Kubernetes 1.21 to newer versions, we're getting lots of weird errors related to what I believe are kubectl layer incompatibilities, like

3:40:15 PM | UPDATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | clusterAwsAuthmanifestB57F2A94
Received response status [FAILED] from custom resource. Message returned: Error: b'configmap/aws-auth configured\nerror: error retrieving RESTMappings to prune: invalid resource extensions/v1bet
a1, Kind=Ingress, Namespaced=true: no matches for kind "Ingress" in version "extensions/v1beta1"\n'

It would be much better if cdk actually validated the layer version vs the intended kubernetes version when synthesising, so that these issues didn't occur

Expected Behavior

cdk should error out, informing me that the selected cluster version doesn't match the configured layer

Current Behavior

No validation occurs, which leads to lots of errors when trying to change the cluster later

Reproduction Steps

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.67.0

Framework Version

2.66.1

Node.js Version

v18.14.2

OS

Ubuntu

Language

Python

Language Version

3.9

Other information

No response

trondhindenes commented 1 year ago

From another issue, it looks like the library in some cases print a warning:

You created a cluster with Kubernetes Version 1.23 without specifying the kubectlLayer property

But I've never seen that warning. Was it removed in a newer version maybe? IMHO it needs to be easy to build rock-solid clusters with cdk.

pahud commented 1 year ago

According to the document:

The version of kubectl used must be compatible with the Kubernetes version of the cluster. kubectl is supported within one minor version (older or newer) of Kubernetes (see Kubernetes version skew policy). Only version 1.20 of kubectl is available in aws-cdk-lib. If you need a different version, you will need to use one of the @aws-cdk/lambda-layer-kubectl-vXY packages.

But I agree with you we probably should implement a check to avoid potential error like that.

I am making this a p2 feature request and any PR would be appreciated!

ShankarDhandapani commented 1 year ago

@pahud According to this reply

According to the document:

The version of kubectl used must be compatible with the Kubernetes version of the cluster. kubectl is supported within one minor version (older or newer) of Kubernetes (see Kubernetes version skew policy). Only version 1.20 of kubectl is available in aws-cdk-lib. If you need a different version, you will need to use one of the @aws-cdk/lambda-layer-kubectl-vXY packages.

But I agree with you we probably should implement a check to avoid potential error like that.

I am making this a p2 feature request and any PR would be appreciated!

when I am trying to use @aws-cdk/lambda-layer-kubectl-v25 package with @aws-quickstart/eks-blueprints in GenericClusterProvider with the property of kubectlLayer then it shows error like Type 'typeof KubectlV25Layer' is missing the following properties from type 'ILayerVersion': layerVersionArn, addPermission, stack, env, and 2 more. Below is the code

...
import { KubectlV25Layer } from "@aws-cdk/lambda-layer-kubectl-v25";
....
.....
.....
const clusterProvider = new EksBlueprint.GenericClusterProvider({
      version: this.props.version,
      kubectlLayer: KubectlV25Layer,
      vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }],
      managedNodeGroups: [
        {
          id: `${id}-nodegroup`,
          minSize: 1,
          maxSize: 2,
          instanceTypes: config.InstanceTypes.map(
            (instance_type) => new ec2.InstanceType(instance_type)
          ),
        },
      ],
    });
.....
.....

CC: @menakakarichiyappakumar

baizele commented 1 year ago

@ShankarDhandapani looks like you need to instantiate it like:

const kubectl = new KubectlV25Layer(this, 'KubectlLayer');

jesseadams commented 1 year ago

I am currently struggling with the same issue.

Kilowhisky commented 1 year ago

This solution does not seem to apply to the v2 of the AWS-CDK.

pahud commented 1 year ago

We probably can add the validation here

https://github.com/aws/aws-cdk/blob/cc4ce12803b756b174445f8493d4239c57f78f97/packages/aws-cdk-lib/aws-eks/lib/cluster.ts#L1473-L1475

I guess the challenge is that the lambda.ILayerVersion does not have any attribute of the kubectl version so it's not easy to compare that.

ravi-vk8679 commented 1 year ago

Thanks for this thread.

Describe the bug

Ever since we upgraded from Kubernetes 1.21 to newer versions, we're getting lots of weird errors related to what I believe are kubectl layer incompatibilities, like

3:40:15 PM | UPDATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | clusterAwsAuthmanifestB57F2A94
Received response status [FAILED] from custom resource. Message returned: Error: b'configmap/aws-auth configured\nerror: error retrieving RESTMappings to prune: invalid resource extensions/v1bet
a1, Kind=Ingress, Namespaced=true: no matches for kind "Ingress" in version "extensions/v1beta1"\n'

It would be much better if cdk actually validated the layer version vs the intended kubernetes version when synthesising, so that these issues didn't occur

Expected Behavior

cdk should error out, informing me that the selected cluster version doesn't match the configured layer

Current Behavior

No validation occurs, which leads to lots of errors when trying to change the cluster later

Reproduction Steps

  • create cluster version 1.23
  • make a change, such as add a node group
  • witness the layer error described above

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.67.0

Framework Version

2.66.1

Node.js Version

v18.14.2

OS

Ubuntu

Language

Python

Language Version

3.9

Other information

No response

Thanks for starting this thread. I was running into the same issue, but I was able to fix it following the suggestions posted here.

I am using CDK v2 and I see that my kubectl version is at its latest. I don't know my cdk is not validating the Kubectl version. Is anyone working on fixing this? Any idea on when will this issue be fixed where it can take the related versions for kubectlLayer based on the kubernetes version provided.

I imported the KubectlLambdaLayer package from here.

`import { KubectlV26Layer } from '@aws-cdk/lambda-layer-kubectl-v26';

kubectlLayer: new KubectlV26Layer(this, 'KubectlLayer'),`

graydenshand commented 5 months ago

I've seen this error several times while attempting to update resources created with cluster.add_manifest().

It appears cloud formation is attempting to use a mismatched api version from what is actually deployed. E.g. attempting to use batch/v1beta1 rather than batch/v1.

Full error response

Received response status [FAILED] from custom resource. Message returned: Error: b'serviceaccount/user created\nerror: error retrieving RESTMappings to prune: invalid resource batch/v1beta1, Kind=CronJob, Namespaced=true: no matches for kind "CronJob" in version "batch/v1beta1"\n' Logs: /aws/lambda/Application-awscdka-Handler886CB40B-q8TSqd5FvHp8 at invokeUserFunction (/var/task/framework.js:2:6) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 1ffd3898-6f7f-49a7-b97d-83518c0dc5fe)

When it occurs, it leaves the stack in an UPDATE_ROLLBACK_FAILED state and there no way to stabilize the stack again. I've had to destroy and recreate my entire cluster every time.

Running Kubernetes 1.29.

benjamin-at-greensky commented 4 months ago

I've deployed a 1.29 EKS cluster via cdk and specify the kubectlLayer as KubectlV29Layer() when creating the cluster and having the same issue as @graydenshand where the only way to get changes applied is to destroy and deploy again. This blocks just about any management of the cluster.

From the lambda kubectl layer logs:

[ERROR] Exception: b'service/serviceXYZ configured\nerror: error retrieving RESTMappings to prune: invalid resource batch/v1beta1, Kind=CronJob, Namespaced=true: no matches for kind "CronJob" in version "batch/v1beta1"\n' Traceback (most recent call last): File "/var/task/index.py", line 14, in handler return apply_handler(event, context) File "/var/task/apply/__init__.py", line 69, in apply_handler kubectl('apply', manifest_file, *kubectl_opts) File "/var/task/apply/__init__.py", line 91, in kubectl raise Exception(output)

kriscoleman commented 3 months ago

We are experiencing the same problem

To make matters worse for us, it appears that KubectlV29 was never released by the GO cdk lib from cdklabs/awscdk-kubectl-go, leaving us with few options to resolve this gracefully.

https://github.com/cdklabs/awscdk-kubectl-go/commits/kubectl.29

pahud commented 3 months ago

@graydenshand @benjamin-at-greensky

Are you able to reproduce this issue for us? For example, after initially create a 1.29 cluster with kubectl v29 layer, what could cause this error after that?

pahud commented 3 months ago

@kriscoleman Can you create a new issue and provide your CDK in Go code snippet in the issue description?

benjamin-at-greensky commented 3 months ago

@pahud I have been able to reproduce this by deploying a fresh EKS cluster with kubectlLayer set to v29 and then redeploying a helm chart with updated values.

import { KubectlV29Layer } from '@aws-cdk/lambda-layer-kubectl-v29';

const clusterProps: GsEksClusterProps = {
...
    kubectlLayer: new KubectlV29Layer(this, 'KubectlLayer'),
...
}

this.cluster = new eks.Cluster(this, 'EksCluster', {
...
    kubectlLayer: clusterProps.kubectlLayer
...
});

After this I will make an update to the cdk that deploys a helm chart (for example I was redeploying one with some annotations on an ingress). I then receive this error when running a cdk deploy:

10:06:27 AM | UPDATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | Clustermanifestrep...63A40109 Received response status [FAILED] from custom resource. Message returned: Error: b'configmap/start-override configured\nerror: error retrieving RESTMappings to prune: invalid resource bat ch/v1beta1, Kind=CronJob, Namespaced=true: no matches for kind "CronJob" in version "batch/v1beta1"\n'

I have no CronJob's deployed to the cluster:

$ kubectl get cronjob -A
No resources found

$ kubectl api-resources | grep cronjob
cronjobs                          cj           batch/v1                          true         CronJob

It is worth mentioning the helm chart I'm deploying has no references to batch/v1beta1 anywhere.

dilshanonline commented 3 months ago

I had the same issue and defining,

from aws_cdk.lambda_layer_kubectl_v28 import KubectlV28Layer

    cluster = eks.Cluster(
        self,
        'EksCluster',
        version=eks.KubernetesVersion.V1_28,
        kubectl_layer=KubectlV28Layer(self, "KubectlLayer"),
  )

solved my issue.

tchcxp commented 1 month ago

I am using @aws-cdk/lambda-layer-kubectl-v30 and KubernetesVersion.V1_30 and I got the same issue as @graydenshand and @benjamin-at-greensky mentioned above when updating resources. The only workaround is to delete and re-create the application and related resources, which is completely impossible for the production environment.

4:02:20 PM | UPDATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource    | ImportedClusterman...aDployment5DA7DFEB
Received response status [FAILED] from custom resource. Message returned: Error: b'deployment.apps/********** configured\nerror: error retrieving RESTMappings to prune: invalid resource batch/v1beta1, Kind=CronJob, Namespaced=true: no matches for kind "CronJob" in version "batch/v1beta1"\n'

Can someone please look into this issue? It's been a while and it technically blocked us from using EKS at the moment.

tchcxp commented 3 weeks ago

I had the same issue and defining,

from aws_cdk.lambda_layer_kubectl_v28 import KubectlV28Layer

    cluster = eks.Cluster(
        self,
        'EksCluster',
        version=eks.KubernetesVersion.V1_28,
        kubectl_layer=KubectlV28Layer(self, "KubectlLayer"),
  )

solved my issue.

I tried to create a new cluster in version 1.28 and use KubectlV28Layer, but still got the same error.

kkandori commented 2 weeks ago

@tchcxp The issue occurs due to the kubectlLayer, specifically the kubectl version in the handler lambda. It seems the cluster is imported, which leads to the following error:

UPDATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | ImportedClusterman...aDployment5DA7DFEB
Received response status [FAILED] from custom resource. Message returned: Error: b'deployment.apps/********** configured\nerror: error retrieving RESTMappings to prune: invalid resource batch/v1beta1, Kind=CronJob, Namespaced=true: no matches for kind "CronJob" in version "batch/v1beta1"\n'

By default, if you don't specify the layer version, it will default to version 20.0.

To resolve this, you need to set the kubectl layer again:

eks.Cluster.fromClusterAttributes(this, 'ImportedCluster', {
    clusterName: clusterName, 
    kubectlRoleArn: kubectlRoleArn,
    blah: blah,
    kubectlLayer: new KubectlV28Layer(this, `kubectl-v28-layer`),  // <--- 
});

This should address the issue.