Open ATymus opened 4 months ago
@ATymus I recommend enabling the debug log level in Karpenter, redeploying and sharing more Resource Specs and Logs:
kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name>
az aks show --resource-group <resource-group> --name <aks-cluster> --query "identity"
az role assignment list --assignee <managed-identity-id> --scope <acr-id>
Also double-check the secret the pod is using to access the ACR.
I have debug mode enabled in Karpenter but no errors for this problem
kubectl describe pod <pod-name> -n <namespace>
Events: Type Reason Age From Message
Normal SandboxChanged 60m (x5 over 60m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulling 59m (x3 over 60m) kubelet Pulling image "myregistry.azurecr.io/test-image:latest" Warning Failed 59m (x3 over 60m) kubelet Failed to pull image "myregistry.azurecr.io/test-image:latest": failed to pull and unpack image "myregistry.azurecr.io/test-image:latest": failed to resolve reference "myregistry.azurecr.io/test-image:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3test-image%3Apull&service=myregistry.azurecr.io: 401 Unauthorized Warning Failed 59m (x3 over 60m) kubelet Error: ErrImagePull Warning Failed 58m (x5 over 60m) kubelet Error: ImagePullBackOff Normal BackOff 45s (x261 over 60m) kubelet Back-off pulling image "myregistry.azurecr.io/test-image:latest"
kubectl describe node <node-name>
Events: Type Reason Age From Message
Normal Starting 59m kubelet Starting kubelet. Normal NodeHasSufficientMemory 59m (x8 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 59m (x8 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 59m (x7 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 59m kubelet Updated Node Allocatable limit across pods Normal RegisteredNode 59m node-controller Node aks-general-purpose-jgmcf event: Registered Node aks-general-purpose-jgmcf in Controller Normal CreatedNNC 59m (x2 over 59m) dnc-rc/node-reconciler Created NodeNetworkConfig aks-general-purpose-jgmcf Normal Unconsolidatable 10m (x4 over 57m) karpenter Can't replace with a cheaper node
az aks show --resource-group <resource-group> --name <aks-cluster> --query "identity"
> {
> "delegatedResources": null,
> "principalId": "26***34",
> "tenantId": "91***9",
> "type": "SystemAssigned",
> "userAssignedIdentities": null
> }
>
az role assignment list --assignee <managed-identity-id> --scope <acr-id>
> [
> {
> "condition": null,
> "conditionVersion": null,
> "createdBy": "a5***71",
> "createdOn": "2024-05-24T07:52:02.216406+00:00",
> "delegatedManagedIdentityResourceId": null,
> "description": "",
> "id": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry/providers/Microsoft.Authorization/roleAssignments/61***1b",
> "name": "61***1b",
> "principalId": "3b***1f",
> "principalName": "49***39",
> "principalType": "ServicePrincipal",
> "resourceGroup": "test-dev",
> "roleDefinitionId": "/subscriptions/3a***53/providers/Microsoft.Authorization/roleDefinitions/83***ec",
> "roleDefinitionName": "AcrPush",
> "scope": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry",
> "type": "Microsoft.Authorization/roleAssignments",
> "updatedBy": "a5***71",
> "updatedOn": "2024-05-24T07:52:02.216406+00:00"
> },
> {
> "condition": null,
> "conditionVersion": null,
> "createdBy": "a5***71",
> "createdOn": "2024-05-24T07:52:02.577232+00:00",
> "delegatedManagedIdentityResourceId": null,
> "description": "",
> "id": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry/providers/Microsoft.Authorization/roleAssignments/85***d5",
> "name": "85***5",
> "principalId": "3b***1f",
> "principalName": "49***39",
> "principalType": "ServicePrincipal",
> "resourceGroup": "test-dev",
> "roleDefinitionId": "/subscriptions/3a***53/providers/Microsoft.Authorization/roleDefinitions/7f***8d",
> "roleDefinitionName": "AcrPull",
> "scope": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry",
> "type": "Microsoft.Authorization/roleAssignments",
> "updatedBy": "a5***71",
> "updatedOn": "2024-05-24T07:52:02.577232+00:00"
> }
> ]
@danielhamelberg Looks like the issue is reproducible. Both grant permission manually or using --attach-acr
method are not working.
Permission is definitely there:
Using image pull secret
should work as workaround, but this is using password-like credential and should not become intended way.
Provide the whole issue demo set-up here: (Please execute the command one by one, as I did not write the command to grant "Azure Kubernetes Service RBAC Cluster Admin" to the logged-in user.)
ranNum=$(echo $RANDOM)
rG=aks-auto-${ranNum}
aks=aks-auto-${ranNum}
acr=acrauto${ranNum}
location=southeastasia
az extension add --name aks-preview
az group create -n ${rG} -l ${location} -o none
# Specify "Standard_D8pds_v5" as this is the one in my sub can be created among 3 availability zones
az aks create -n ${aks} -g ${rG} --node-vm-size Standard_D8pds_v5 \
--sku automatic --no-ssh-key
az acr create --resource-group ${rG} --name ${acr} --sku Basic
az acr login --name ${acr}
docker pull nginx
docker tag nginx ${acr}.azurecr.io/nginx
docker push ${acr}.azurecr.io/nginx
kubeletObjID=$(az aks show -n ${aks} -g ${rG} --query identityProfile.kubeletidentity.objectId -o tsv)
acrResID=$(az resource show -n ${acr} -g ${rG} \
--namespace Microsoft.ContainerRegistry --resource-type registries --query id -o tsv)
az role assignment create --assignee-object-id ${kubeletObjID} \
--assignee-principal-type ServicePrincipal --role "AcrPull" --scope ${acrResID}
# Grant your own user as "Azure Kubernetes Service RBAC Cluster Admin" and I skip the CLI command for that here.
az aks get-credentials -n ${aks} -g ${rG}
# Deploy test Pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: ${acr}.azurecr.io/nginx
imagePullPolicy: IfNotPresent
EOF
# Wait for 3 mins for new node being provisioned and see the result
sleep 180;
kubectl describe po nginx
# Result: ImagePullErr
kubectl delete po nginx
# Try `--attach-acr` method, which is intended approach
az aks update -n ${aks} -g ${rG} --attach-acr ${acr}
# Deploy Pod again
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: ${acr}.azurecr.io/nginx
imagePullPolicy: IfNotPresent
EOF
# Wait for 3 mins for new node being provisioned and see the result
sleep 180;
kubectl describe po nginx
# Still failed
Debug: Go inside the node:
root [ / ]# crictl pull acrauto2462.azurecr.io/nginx
E0807 18:18:39.889144 21228 remote_image.go:180] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"acrauto2462.azurecr.io/nginx:latest\": failed to resolve reference \"acrauto2462.azurecr.io/nginx:latest\": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://acrauto2462.azurecr.io/oauth2/token?scope=repository%3Anginx%3Apull&service=acrauto2462.azurecr.io: 401 Unauthorized" image="acrauto2462.azurecr.io/nginx"
FATA[0000] pulling image: failed to pull and unpack image "acrauto2462.azurecr.io/nginx:latest": failed to resolve reference "acrauto2462.azurecr.io/nginx:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://acrauto2462.azurecr.io/oauth2/token?scope=repository%3Anginx%3Apull&service=acrauto2462.azurecr.io: 401 Unauthorized
We can see using crictl
also cannot help on pulling images.
I also tried to use Ubuntu
sku but does not work.
At the time, I am realizing something. So I avoid use the node created by Karpenter and use system nodepool instead:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx-test
spec:
nodeSelector:
kubernetes.azure.com/agentpool: nodepool1
tolerations:
- key: CriticalAddonsOnly
operator: Exists
containers:
- name: nginx
image: ${acr}.azurecr.io/nginx
imagePullPolicy: IfNotPresent
EOF
I don't know why there is error below. I also tried with busybox, but still cannot be created because the below (looks like deploying in system nodepool is not supposed somehow). But the point here is: the image is now can be pulled: only in system nodepool.
kubectl logs nginx-test
exec /docker-entrypoint.sh: exec format error
We have just discovered a suspect: DisableKubeletCloudCredentialProviders
feature gate for kubelet is set to true
by default beginning 1.29. Looks like we haven't made an appropriate response to that yet.
This overall issue seems to not be present on 1.28 as well (from my reproduction attempt, at least), further backing that claim.
Will give updates on the potential fix for this.
// CredentialProviderURL returns the URL for OOT credential provider,
// or an empty string if OOT provider is not to be used
func CredentialProviderURL(kubernetesVersion, arch string) string {
minorVersion := semver.MustParse(kubernetesVersion).Minor
if minorVersion < 30 {
return ""
}
Looks like from this code for out of tree provider we default to not including the settings for out of tree provider if CredentialProviderURL is empty.
DisableKubeletCloudCredentialProviders | false | Alpha | 1.23 | 1.28 |
---|---|---|---|---|
DisableKubeletCloudCredentialProviders | true | Beta | 1.29 | – |
We don't have the logic conditionally enabled for 1.29, so auth pull will not work for that specific kubernetes version.
This can be easily fixed by
A) Switching 1.29 to use out of tree credential provider.(Have karpenter pass in the rest of the required OOT Provider kubelet flags.) B) Adding the default to false for the feature gate for 1.29 clusters.
I believe its best we use option A.
Merged in the fix, need to still release it so keeping this open for tracking
@Bryce-Soghigian - do you know when this will be released? I am trying to triangulate if I should wait or go back to standard node pools for now.
@vikas-rajvanshy Its in the current release that's rolling out. I believe you can track it via the aks release tracker https://releases.aks.azure.com/. The fix is a part of the v20240827
release.
Version
Karpenter Version: v0.5.0
Kubernetes Version: v1.29.4
Expected Behavior
The expected behavior is that the nodes can access the private ACR using the configured managed identity.
Actual Behavior
Nodes created by Karpenter and regular Kubernetes nodes both have the same managed identity configured. This managed identity has been granted both AcrPull and AcrPush roles on the ACR. However, while pods on regular Kubernetes nodes can successfully pull images from the private ACR, pods on nodes created by Karpenter fail with the following error: 401 Unauthorized
Steps to Reproduce the Problem
az aks update -n aks-dev -g rg-dev --attach-acr myregistry
Resource Specs and Logs
Community Note