cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.27k stars 2.8k forks source link

Cilium Helm install on EKS fails due to label issues #30092

Open strongjz opened 6 months ago

strongjz commented 6 months ago

Is there an existing issue for this?

What happened?

  1. Deploy EKS cluster via eksctl
  2. Deploy cilium with helm chart via cilium install
  3. Restart coredns or run connectivity test
  4. Pods stick in CreatingContainter status

Adding labels to the cilium config maps from labels: "k8s:io.kubernetes\\.pod\\.namespace k8s:k8s-app k8s:app k8s:name"

Fixes the issue

https://docs.cilium.io/en/stable/operations/performance/scalability/identity-relevant-labels/#configuring-identity-relevant-labels

The regex to check the labels should take the AWS Cluster ARN into account for a default install.

Cilium Version

v1.14.5

Kernel Version

amazon-eks-node-1.27-v20231230 ami-012689cd52612e266

5.10.201-191.748.amzn2.x86_64 #1 SMP Mon Nov 27 18:28:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

EKS Kubernetes Worker AMI with AmazonLinux2 image, (k8s: 1.27.7, containerd: 1.7.*)

Sysdump

deleted cluster before got this information

Relevant log output

2024-01-03T19:20:56.293734186Z level=warning msg="Key allocation attempt failed" attempt=5 error="unable to allocate ID 17211 for key [k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=cilium-test k8s:io.cilium.k8s.policy.cluster=arn:aws:eks:us-east-2:123456789111:cluster/strongjz-test k8s:io.kubernetes.pod.namespace=cilium-test]: CiliumIdentity.cilium.io \"17211\" is invalid: metadata.labels: Invalid value: \"arn:aws:eks:us-east-2:123456789111:cluster/strongjz-test\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')" key="[k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=cilium-test k8s:io.cilium.k8s.policy.cluster=arn:aws:eks:us-east-2:123456789111:cluster/strongjz-test k8s:io.kubernetes.pod.namespace=cilium-test]" subsys=allocator

Anything else?

Looks like the cluster name is updated here https://github.com/cilium/cilium/blob/main/pkg/identity/numericidentity.go#L261

so it either should strip the ARN arn:aws:eks:us-east-2:123456789111:cluster/strongjz-test and just take the cluster name or update the regex to include :

Code of Conduct

squeed commented 5 months ago

Thanks for the report. Certainly, restricting the set of identity-relevant labels by default is not possible; we can't know in advance what labels are relevant for security.

I'm a bit confused where the colons are sneaking in (I don't have all of this code in my head). This is a cluster name with colons? Or is there something else going on?

squeed commented 5 months ago

(as an aside, we do create clusters with eksctl as part of CI, so there is something more going on)

jcrowthe commented 4 months ago

I ran into this issue on a brand new EKS cluster. It was a test cluster launched directly in the AWS console.

After running the appropriate command to generate a kubeconfig file (aws eks update-kubeconfig --name my-cluster) my kubeconfig file contained the following:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ...
    server: https://...gr7.us-west-2.eks.amazonaws.com
  name: arn:aws:eks:us-west-2:0123456789:cluster/my-cluster
contexts:
- context:
    cluster: arn:aws:eks:us-west-2:0123456789:cluster/my-cluster
    user: arn:aws:eks:us-west-2:0123456789:cluster/my-cluster
  name: arn:aws:eks:us-west-2:0123456789:cluster/my-cluster
current-context: arn:aws:eks:us-west-2:0123456789:cluster/my-cluster
kind: Config
preferences: {}
users:
- name: arn:aws:eks:us-west-2:0123456789:cluster/my-cluster
  user:
    exec:
...

The cilium-cli appears to use the contents of the kubeconfig file in the configuration it installs on cluster. On this brand-new EKS cluster, my CoreDNS pods were stuck in ContainerCreating, and after checking the cilium-agent logs, I found that the identity was not being created due to the labels having characters not allowed in DNS names (ie. the purpose of this ticket).

To fix this, I did the following:

  1. Modify my kubeconfig. Replace all arn:aws:eks:us-west-2:0123456789:cluster/my-cluster with just my-cluster
  2. cilium upgrade ...
  3. Ensure the cilium-operator and all cilium daemonset pods are restarted.

After this, Cilium started creating the identities properly:

level=info msg="Successful endpoint creation" ciliumEndpointName=kube-system/coredns-86bd649884-r42hh ...

@squeed does CI use the aws-cli command to generate the kubeconfig, or by some other way? I'm guessing this is the source of the discrepancy.

mikee commented 2 months ago

I have run into this too on a brand new eks cluster. So the cilium cli auto detects the name of the cluster from the kube config. The aws eks kube config setup puts in cluster names with the ARN format. Just overriding the cluster name in my values was enough to move past the issue. cilium install --values values.yaml

values.yaml

cluster:
    id: 0
    name: disconnected-cluster
Azahorscak commented 2 months ago

Same, thanks for the answer mikee. I had colons in the cluster name and needed to uninstall and reinstall like so.

cilium install --set cluster.name=my-cluster