Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
392 stars 63 forks source link

Nodes created by Karpenter are unable to pull images from a private Azure Container Registry (ACR), resulting in a 401 Unauthorized error #411

Open ATymus opened 4 months ago

ATymus commented 4 months ago

Version

Karpenter Version: v0.5.0

Kubernetes Version: v1.29.4

Expected Behavior

The expected behavior is that the nodes can access the private ACR using the configured managed identity.

Actual Behavior

Nodes created by Karpenter and regular Kubernetes nodes both have the same managed identity configured. This managed identity has been granted both AcrPull and AcrPush roles on the ACR. However, while pods on regular Kubernetes nodes can successfully pull images from the private ACR, pods on nodes created by Karpenter fail with the following error: 401 Unauthorized Screenshot 2024-06-19 at 12 57 59

Steps to Reproduce the Problem

az aks update -n aks-dev -g rg-dev --attach-acr myregistry

Resource Specs and Logs

Events: Type Reason Age From Message


Warning FailedScheduling 39m default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod. Normal Scheduled 37m default-scheduler Successfully assigned default/test-779d54dfd-djk7d to aks-general-purpose-zfxqd Normal Nominated 39m karpenter Pod should schedule on: nodeclaim/general-purpose-zfxqd Warning FailedCreatePodSandBox 37m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1d4cfc55733b95293627a57ffb7a20de269debdf1b2afc2116aeb103042afeb4": plugin type="cilium-cni" failed (add): failed to invoke delegated plugin ADD for IPAM: http request failed: Post "http://localhost:10090/network/requestipconfigs": dial tcp 127.0.0.1:10090: connect: connection refused; failed to request IP address from CNS Normal SandboxChanged 36m (x5 over 37m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulling 36m (x3 over 36m) kubelet Pulling image "myregistry.azurecr.io/test-image:latest" Warning Failed 36m (x3 over 36m) kubelet Failed to pull image "myregistry.azurecr.io/test-image:latest": failed to pull and unpack image "myregistry.azurecr.io/test-image:latest": failed to resolve reference "myregistry.azurecr.io/test-image:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3Atest-image%3Apull&service=myregistry.azurecr.io: 401 Unauthorized Warning Failed 36m (x3 over 36m) kubelet Error: ErrImagePull Warning Failed 35m (x5 over 36m) kubelet Error: ImagePullBackOff Normal BackOff 2m29s (x152 over 36m) kubelet Back-off pulling image "myregistry.azurecr.io/test-image:latest"

Community Note

danielhamelberg commented 4 months ago

@ATymus I recommend enabling the debug log level in Karpenter, redeploying and sharing more Resource Specs and Logs: kubectl describe pod <pod-name> -n <namespace> kubectl describe node <node-name> az aks show --resource-group <resource-group> --name <aks-cluster> --query "identity" az role assignment list --assignee <managed-identity-id> --scope <acr-id> Also double-check the secret the pod is using to access the ACR.

ATymus commented 4 months ago

I have debug mode enabled in Karpenter but no errors for this problem kubectl describe pod <pod-name> -n <namespace>

Events: Type Reason Age From Message


Normal SandboxChanged 60m (x5 over 60m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulling 59m (x3 over 60m) kubelet Pulling image "myregistry.azurecr.io/test-image:latest" Warning Failed 59m (x3 over 60m) kubelet Failed to pull image "myregistry.azurecr.io/test-image:latest": failed to pull and unpack image "myregistry.azurecr.io/test-image:latest": failed to resolve reference "myregistry.azurecr.io/test-image:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3test-image%3Apull&service=myregistry.azurecr.io: 401 Unauthorized Warning Failed 59m (x3 over 60m) kubelet Error: ErrImagePull Warning Failed 58m (x5 over 60m) kubelet Error: ImagePullBackOff Normal BackOff 45s (x261 over 60m) kubelet Back-off pulling image "myregistry.azurecr.io/test-image:latest"

kubectl describe node <node-name>

Events: Type Reason Age From Message


Normal Starting 59m kubelet Starting kubelet. Normal NodeHasSufficientMemory 59m (x8 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 59m (x8 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 59m (x7 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 59m kubelet Updated Node Allocatable limit across pods Normal RegisteredNode 59m node-controller Node aks-general-purpose-jgmcf event: Registered Node aks-general-purpose-jgmcf in Controller Normal CreatedNNC 59m (x2 over 59m) dnc-rc/node-reconciler Created NodeNetworkConfig aks-general-purpose-jgmcf Normal Unconsolidatable 10m (x4 over 57m) karpenter Can't replace with a cheaper node

az aks show --resource-group <resource-group> --name <aks-cluster> --query "identity"

> {
>   "delegatedResources": null,
>   "principalId": "26***34",
>   "tenantId": "91***9",
>   "type": "SystemAssigned",
>   "userAssignedIdentities": null
> }
> 

az role assignment list --assignee <managed-identity-id> --scope <acr-id>


>  [
>   {
>     "condition": null,
>     "conditionVersion": null,
>     "createdBy": "a5***71",
>     "createdOn": "2024-05-24T07:52:02.216406+00:00",
>     "delegatedManagedIdentityResourceId": null,
>     "description": "",
>     "id": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry/providers/Microsoft.Authorization/roleAssignments/61***1b",
>     "name": "61***1b",
>     "principalId": "3b***1f",
>     "principalName": "49***39",
>     "principalType": "ServicePrincipal",
>     "resourceGroup": "test-dev",
>     "roleDefinitionId": "/subscriptions/3a***53/providers/Microsoft.Authorization/roleDefinitions/83***ec",
>     "roleDefinitionName": "AcrPush",
>     "scope": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry",
>     "type": "Microsoft.Authorization/roleAssignments",
>     "updatedBy": "a5***71",
>     "updatedOn": "2024-05-24T07:52:02.216406+00:00"
>   },
>   {
>     "condition": null,
>     "conditionVersion": null,
>     "createdBy": "a5***71",
>     "createdOn": "2024-05-24T07:52:02.577232+00:00",
>     "delegatedManagedIdentityResourceId": null,
>     "description": "",
>     "id": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry/providers/Microsoft.Authorization/roleAssignments/85***d5",
>     "name": "85***5",
>     "principalId": "3b***1f",
>     "principalName": "49***39",
>     "principalType": "ServicePrincipal",
>     "resourceGroup": "test-dev",
>     "roleDefinitionId": "/subscriptions/3a***53/providers/Microsoft.Authorization/roleDefinitions/7f***8d",
>     "roleDefinitionName": "AcrPull",
>     "scope": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry",
>     "type": "Microsoft.Authorization/roleAssignments",
>     "updatedBy": "a5***71",
>     "updatedOn": "2024-05-24T07:52:02.577232+00:00"
>   }
> ]
JoeyC-Dev commented 3 months ago

@danielhamelberg Looks like the issue is reproducible. Both grant permission manually or using --attach-acr method are not working. image

Permission is definitely there: image

Using image pull secret should work as workaround, but this is using password-like credential and should not become intended way.

Provide the whole issue demo set-up here: (Please execute the command one by one, as I did not write the command to grant "Azure Kubernetes Service RBAC Cluster Admin" to the logged-in user.)

ranNum=$(echo $RANDOM)
rG=aks-auto-${ranNum}
aks=aks-auto-${ranNum}
acr=acrauto${ranNum}
location=southeastasia

az extension add --name aks-preview

az group create -n ${rG} -l ${location} -o none

# Specify "Standard_D8pds_v5" as this is the one in my sub can be created among 3 availability zones
az aks create -n ${aks} -g ${rG} --node-vm-size Standard_D8pds_v5 \
--sku automatic --no-ssh-key

az acr create --resource-group ${rG} --name ${acr} --sku Basic
az acr login --name ${acr}
docker pull nginx
docker tag nginx ${acr}.azurecr.io/nginx
docker push ${acr}.azurecr.io/nginx

kubeletObjID=$(az aks show -n ${aks} -g ${rG} --query identityProfile.kubeletidentity.objectId -o tsv)

acrResID=$(az resource show -n ${acr} -g ${rG} \
--namespace Microsoft.ContainerRegistry --resource-type registries --query id -o tsv)

az role assignment create --assignee-object-id ${kubeletObjID} \
--assignee-principal-type ServicePrincipal --role "AcrPull" --scope ${acrResID}

# Grant your own user as "Azure Kubernetes Service RBAC Cluster Admin" and I skip the CLI command for that here.

az aks get-credentials -n ${aks} -g ${rG} 

# Deploy test Pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: ${acr}.azurecr.io/nginx
    imagePullPolicy: IfNotPresent
EOF

# Wait for 3 mins for new node being provisioned and see the result
sleep 180;
kubectl describe po nginx
# Result: ImagePullErr

kubectl delete po nginx

# Try `--attach-acr` method, which is intended approach
az aks update -n ${aks} -g ${rG} --attach-acr ${acr}

# Deploy Pod again
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: ${acr}.azurecr.io/nginx
    imagePullPolicy: IfNotPresent
EOF

# Wait for 3 mins for new node being provisioned and see the result
sleep 180;
kubectl describe po nginx

# Still failed

Debug: Go inside the node:

root [ / ]# crictl pull acrauto2462.azurecr.io/nginx
E0807 18:18:39.889144   21228 remote_image.go:180] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"acrauto2462.azurecr.io/nginx:latest\": failed to resolve reference \"acrauto2462.azurecr.io/nginx:latest\": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://acrauto2462.azurecr.io/oauth2/token?scope=repository%3Anginx%3Apull&service=acrauto2462.azurecr.io: 401 Unauthorized" image="acrauto2462.azurecr.io/nginx"
FATA[0000] pulling image: failed to pull and unpack image "acrauto2462.azurecr.io/nginx:latest": failed to resolve reference "acrauto2462.azurecr.io/nginx:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://acrauto2462.azurecr.io/oauth2/token?scope=repository%3Anginx%3Apull&service=acrauto2462.azurecr.io: 401 Unauthorized 

We can see using crictl also cannot help on pulling images. I also tried to use Ubuntu sku but does not work.

At the time, I am realizing something. So I avoid use the node created by Karpenter and use system nodepool instead:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx-test
spec:
  nodeSelector:
    kubernetes.azure.com/agentpool: nodepool1
  tolerations:
    - key: CriticalAddonsOnly
      operator: Exists
  containers:
  - name: nginx
    image: ${acr}.azurecr.io/nginx
    imagePullPolicy: IfNotPresent
EOF

image I don't know why there is error below. I also tried with busybox, but still cannot be created because the below (looks like deploying in system nodepool is not supposed somehow). But the point here is: the image is now can be pulled: only in system nodepool.

kubectl logs nginx-test
exec /docker-entrypoint.sh: exec format error
comtalyst commented 3 months ago

We have just discovered a suspect: DisableKubeletCloudCredentialProviders feature gate for kubelet is set to true by default beginning 1.29. Looks like we haven't made an appropriate response to that yet.

This overall issue seems to not be present on 1.28 as well (from my reproduction attempt, at least), further backing that claim.

Will give updates on the potential fix for this.

Bryce-Soghigian commented 3 months ago

https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/providers/imagefamily/bootstrap/aksbootstrap.go#L425

// CredentialProviderURL returns the URL for OOT credential provider,
// or an empty string if OOT provider is not to be used
func CredentialProviderURL(kubernetesVersion, arch string) string {
  minorVersion := semver.MustParse(kubernetesVersion).Minor
  if minorVersion < 30 { 
    return ""
  }

Looks like from this code for out of tree provider we default to not including the settings for out of tree provider if CredentialProviderURL is empty.

DisableKubeletCloudCredentialProviders false Alpha 1.23 1.28
DisableKubeletCloudCredentialProviders true Beta 1.29

We don't have the logic conditionally enabled for 1.29, so auth pull will not work for that specific kubernetes version.

This can be easily fixed by

A) Switching 1.29 to use out of tree credential provider.(Have karpenter pass in the rest of the required OOT Provider kubelet flags.) B) Adding the default to false for the feature gate for 1.29 clusters.

I believe its best we use option A.

Bryce-Soghigian commented 2 months ago

Merged in the fix, need to still release it so keeping this open for tracking

vikas-rajvanshy commented 2 months ago

@Bryce-Soghigian - do you know when this will be released? I am trying to triangulate if I should wait or go back to standard node pools for now.

Bryce-Soghigian commented 2 months ago

@vikas-rajvanshy Its in the current release that's rolling out. I believe you can track it via the aks release tracker https://releases.aks.azure.com/. The fix is a part of the v20240827 release.