aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.61k stars 3.9k forks source link

aws-eks: KubernetesManifest Overwrite option invalid now that ServerSideApply is defaulted #31697

Open diranged opened 1 week ago

diranged commented 1 week ago

Describe the bug

At some point recently, the kubectl CLI started setting --server-side-apply as the default behavior. The problem is that with https://github.com/kubernetes/kubernetes/issues/44165, kubectl apply -f ... no longer works the way you'd expect. On a server-side-apply, it seems that Kubernetes will refuse to update a resource that already exists, which then reports back to the Lambda function an AlreadyExists error.

Regression Issue

Last Known Working CDK Version

unknown

Expected Behavior

I would expect that kubectl apply -f ... just works ... (which is configured by setting overwrite: true on the KubernetesManifest resource)... but instead it's failing.

Current Behavior

Here are the logs from the Lambda function trying to run kubectl apply -f on a resource that happens to already exist in the cluster:

[INFO]  2024-10-05T18:12:44.871Z    bc961513-6915-4749-a32f-a787912469b1    Running command: ['kubectl', 'apply', '--kubeconfig', '/tmp/kubeconfig', '-f', '/tmp/manifest.yaml']
[INFO]  2024-10-05T18:12:44.871Z    bc961513-6915-4749-a32f-a787912469b1    manifest written to: /tmp/manifest.yaml
[INFO]  2024-10-05T18:12:42.741Z    bc961513-6915-4749-a32f-a787912469b1    Running command: ['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::...:role/...-3bVbyuZfXXf4', '--name', '....', '--kubeconfig', '/tmp/kubeconfig']
[INFO]  2024-10-05T18:12:42.740Z    bc961513-6915-4749-a32f-a787912469b1    {"RequestType": "Create", "ServiceToken": "arn:aws:lambda:us-west-2:...:function:INFRA...-oNQV5X2TDIjj", "ResponseURL": "...", "StackId": "arn:aws:cloudformation:us-west-2:...:stack/...-ContinuousDeploymentNestedStackContinuousDeploymentNes-8QONVDS1QSK3/4fb3cbe0-8345-11ef-afa7-067d0aea149f", "RequestId": "ec9880d5-f9bf-498d-9780-12134c81f17d", "LogicalResourceId": "ArgoCDSystemPostHelmResources073D6BA8", "ResourceType": "Custom::AWSCDK-EKS-KubernetesResource", "ResourceProperties": {"ServiceToken": "arn:aws:lambda:us-west-2:...:function:...-oNQV5X2TDIjj", "Overwrite": "true", "PruneLabel": "aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea", "ClusterName": "...", "Manifest": "[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"AppProject\",\"metadata\":{\"name\":\"default\",\"namespace\":\"argocd-system\",\"labels\":{\"aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea\":\"\"}},\"spec\":{\"clusterResourceWhitelist\":[{\"group\":\"*\",\"kind\":\"*\"}],\"destinations\":[{\"namespace\":\"*\",\"server\":\"https://kubernetes.default.svc\"}],\"sourceRepos\":[\"*\"]}}]", "RoleArn": "arn:aws:iam::...:role/...-3bVbyuZfXXf4"}}
{
  "RequestType": "Create",
  "ServiceToken": "arn:aws:lambda:us-west-2:...:function:...-oNQV5X2TDIjj",
  "ResponseURL": "...",
  "StackId": "arn:aws:cloudformation:us-west-2:...:stack/...-ContinuousDeploymentNestedStackContinuousDeploymentNes-8QONVDS1QSK3/4fb3cbe0-8345-11ef-afa7-067d0aea149f",
  "RequestId": "ec9880d5-f9bf-498d-9780-12134c81f17d",
  "LogicalResourceId": "ArgoCDSystemPostHelmResources073D6BA8",
  "ResourceType": "Custom::AWSCDK-EKS-KubernetesResource",
  "ResourceProperties": {
    "ServiceToken": "arn:aws:lambda:us-west-2:...:function:...-oNQV5X2TDIjj",
    "Overwrite": "true",
    "PruneLabel": "aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea",
    "ClusterName": "...",
    "Manifest": "[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"AppProject\",\"metadata\":{\"name\":\"default\",\"namespace\":\"argocd-system\",\"labels\":{\"aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea\":\"\"}},\"spec\":{\"clusterResourceWhitelist\":[{\"group\":\"*\",\"kind\":\"*\"}],\"destinations\":[{\"namespace\":\"*\",\"server\":\"https://kubernetes.default.svc\"}],\"sourceRepos\":[\"*\"]}}]",
    "RoleArn": "arn:aws:iam::...:role/...-3bVbyuZfXXf4"
  }
}
[ERROR] Exception: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": appprojects.argoproj.io "default" already exists\n'
Traceback (most recent call last):
  File "/var/task/index.py", line 14, in handler
    return apply_handler(event, context)
  File "/var/task/apply/__init__.py", line 60, in apply_handler
    kubectl('apply', manifest_file, *kubectl_opts)
  File "/var/task/apply/__init__.py", line 91, in kubectl
    raise Exception(output)

If we go and look at the Kubernetes Audit logs, we can see that ArgoCD first creates this default resource, and then the next Create call fails with a 409 and is a server-side-apply call:

Here's the first create call (made by ArgoCD, and uncontrollable by us)

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "21c5b6a7-eeee-4676-b7e7-2764d16517f1",
  "stage": "ResponseComplete",
  "requestURI": "/apis/argoproj.io/v1alpha1/namespaces/argocd-system/appprojects",
  "verb": "create",
  "user": {
    "username": "system:serviceaccount:argocd-system:argocd-server",
    "uid": "ec844b9d-8ce1-4325-905a-479df42f0aed",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:argocd-system",
      "system:authenticated"
    ],
    "extra": {
...
    }
  },
  "sourceIPs": [
    "..."
  ],
  "userAgent": "argocd-server/v0.0.0 (linux/arm64) kubernetes/$Format",
  "objectRef": {
    "resource": "appprojects",
    "namespace": "argocd-system",
    "name": "default",
    "apiGroup": "argoproj.io",
    "apiVersion": "v1alpha1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 201
  },
  "requestReceivedTimestamp": "2024-10-05T18:12:49.590120Z",
  "stageTimestamp": "2024-10-05T18:12:49.594167Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by RoleBinding \"argocd-system-server/argocd-system\" of Role \"argocd-system-server\" to ServiceAccount \"argocd-server/argocd-system\""
  }
}

Then we see a second call, this time via kubectl... note the kubectl-client-side-apply in the requestURI path:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "64c914a4-e184-4079-a6b6-8c5630e473fd",
  "stage": "ResponseComplete",
  "requestURI": "/apis/argoproj.io/v1alpha1/namespaces/argocd-system/appprojects?fieldManager=kubectl-client-side-apply&fieldValidation=Strict",
  "verb": "create",
  "user": {
    "username": "arn:aws:sts::...:assumed-role/.../EKSGetTokenAuth",
    "uid": "aws-iam-authenticator:...:...",
    "groups": [
      "system:authenticated"
    ],
    "extra": {
...
    }
  },
  "sourceIPs": [
    "..."
  ],
  "userAgent": "kubectl/v1.28.3 (linux/amd64) kubernetes/a8a1abc",
  "objectRef": {
    "resource": "appprojects",
    "namespace": "argocd-system",
    "name": "default",
    "apiGroup": "argoproj.io",
    "apiVersion": "v1alpha1"
  },
  "responseStatus": {
    "metadata": {},
    "status": "Failure",
    "message": "appprojects.argoproj.io \"default\" already exists",
    "reason": "AlreadyExists",
    "details": {
      "name": "default",
      "group": "argoproj.io",
      "kind": "appprojects"
    },
    "code": 409
  },
  "requestReceivedTimestamp": "2024-10-05T18:12:49.605439Z",
  "stageTimestamp": "2024-10-05T18:12:49.614296Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "EKS Access Policy: allowed by ClusterRoleBinding \"arn:aws:iam::...:role/...+arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy\" of ClusterRole \"arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy\" to User \"...\""
  }
}

Reproduction Steps

N/A

Possible Solution

I think that when overwrite: true is set, then the --server-side=false flag should also be applied to the command..

Additional Information/Context

No response

CDK CLI Version

2.161.1

Framework Version

No response

Node.js Version

18

OS

linux

Language

TypeScript

Language Version

No response

Other information

No response

ashishdhingra commented 1 week ago

@diranged Good morning. Could you please confirm if this is a CDK issue and share minimal code to reproduce the issue? Or this issue was originally intended for https://github.com/kubernetes/kubectl/ repo?

Thanks, Ashish

diranged commented 1 week ago

Honestly I think this is a CDK issue because the overwrite: true flag is no longer behaving the way the user expects...

pahud commented 1 week ago

Hi

@diranged

Can you share a minimal CDK app that we can test and reproduce this issue in our account?

And please let us know which version was working and which version is broken now with exactly the same code.

Thanks.

diranged commented 1 week ago

@pahud, I can try - but I don't know when I'll have time to get that done... but I will note that rolling back to the V27 Lambda Kubectl function resolves the issue.

pahud commented 1 week ago

The error message "appprojects.argoproj.io "default" already exists" indicates that you're trying to create an Argo CD AppProject resource named "default" in the "argocd-system" namespace, but a resource with the same name already exists in your cluster. [1]

This situation typically occurs when:

You've previously created an AppProject named "default" in the same namespace.

The "default" AppProject was automatically created during the Argo CD installation process.

Argo CD typically creates a "default" AppProject during its initial setup, which is why you're encountering this error when trying to apply your manifest.

To resolve this issue:

Update instead of create: If you want to modify the existing "default" AppProject, you can use kubectl apply with the --force flag:

kubectl apply -f your-manifest.yaml --force

looks like there's already a default AppProject and you are install another one with the same name? I am not sure if this is related to CDK but the issue https://github.com/kubernetes/kubernetes/issues/44165 you mentioned is in 2017 and I am not sure if this is related to CDK.

I am not the expert of ArgoCD but hope this help.

diranged commented 1 week ago

@pahud, So this code has been in place and untouched (other than updates to aws-cdk-lib, aws-cdk and the @aws-cdk/lambda-layer-kubectl-v28 typescript libraries) for 2 years now. It started breaking after some recent update (though it's hard right now for me to pinpoint it, because we don't run integration tests 100% of the time). Yes, ArgoCD auto-creates the default AppProject object on startup - it's a foot-gun being discussed at https://github.com/argoproj/argo-cd/issues/11058 ... however, this behavior has been in place for several years.

Given the following code:

const cdk8sPostChart = new cdk8s.Chart(new cdk8s.App(), 'PostManifestBuilder', {
  namespace: this.namespace,
});
new AppProject(cdk8sPostChart, 'DefaultProject', {
  metadata: { name: 'default' },
  spec: props.defaultProjectSpec ?? DEFAULT_PROJECT_SPEC,
});
new KubernetesManifest(this, 'PostHelmResources', {
  cluster: this.cluster,
  overwrite: true, // the argo controller creates a default 'appProject' on startup, we overwrite it
  prune: true,
  manifest: cdk8sPostChart.toJson(),
}).node.addDependency(helmChart);

One would expect that regardless of whether the default object already exists or not, it would be overwritten via kubectl apply -f manifest.yaml... but it seems that in some cases that does not happen. I've tried to replicate this with a local kind environment using pure kubectl commands and for some reason I cannot .... which leads me to believe that there's actually a race-condition happening that is made worse by the Server Side Apply setup. The first create call is at 2024-10-05T18:12:49.590120Z and the second one comes in a hair later at 2024-10-05T18:12:49.605439Z.

Do you have specific objections to either: a) exposing --server-side=<bool> an option for the KubernetesManifest resource b) using --server-side=false when overwrite==true c) using --force-conflicts=true when overwrite==true