Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
394 stars 65 forks source link

Workloads that rely on workload identity don’t function on nodes created by Karpenter #563

Open nfsouzaj opened 2 weeks ago

nfsouzaj commented 2 weeks ago

Version

Karpenter Version: v0.0.0

Kubernetes Version: v1.29.7

Hi, your request to get the app_version (helm ls -A --all -o json | jq '.[] | select(.name=="karpenter") | .app_version' -r) returns nothing as the field is empty. Following the two karpenter ocurrances: { "name": "aks-managed-karpenter-overlay", "namespace": "kube-system", "revision": "1950", "updated": "2024-11-04 17:26:31.707192908 +0000 UTC", "status": "deployed", "chart": "karpenter-overlay-addon-0.1.0-f334e0b6b9b9c329f88ac4f4578acedf1d519021", "app_version": "" }, { "name": "aks-managed-karpenter-overlay-base", "namespace": "kube-system", "revision": "1951", "updated": "2024-11-04 17:25:58.10704853 +0000 UTC", "status": "deployed", "chart": "karpenter-overlay-base-addon-0.1.0-53202eafdc89edaceb7b54487c6b40a51d91e65e", "app_version": "" },

Expected Behavior

Pods that rely on workload identity to communicate with Azure PaaS services such as AKV and DNS Zones have to work properly.

Actual Behavior

We use workload identity to enable External Secrets to communicate with Azure Key Vault and External DNS to connect with DNS Zones. After migrating one of my environments to use Karpenter-provisioned nodes, I encountered issues: External Secrets could no longer connect to AKV, and External DNS couldn’t reach the DNS zone.

After hours of troubleshooting, I suspected that Karpenter might be related to the issue. I switched back to regular nodes, and everything immediately started working again. Same user identity, same cluster, just a regular node provisioned by the cloud provider.

I discovered that nodes created without Karpenter have a specific label injected: kubernetes.azure.com/kubelet-identity-client-id, while Karpenter nodes lack this label.

Steps to Reproduce the Problem

Resource Specs and Logs

I am getting the following error: time="2024-11-04T17:08:12Z" level=fatal msg="Failed to do run once: WorkloadIdentityCredential: unable to resolve an endpoint: server response error:\n context deadline exceeded"

Community Note