Azure / acr

Azure Container Registry samples, troubleshooting tips and references
https://aka.ms/acr
Other
164 stars 112 forks source link

Azure AD latency when attaching ACR - day later replication not completed #695

Closed mloskot closed 1 year ago

mloskot commented 1 year ago

I followed https://learn.microsoft.com/en-us/azure/aks/cluster-container-registry-integration to enable my AKS cluster with access to my private ACR. Everything seemed worked fine:

image

apart from the Azure replication process not completing and I'm still seeing the Identity not found for the two of AKS cluster identities that I assigned roles with my ACR:

image

https://learn.microsoft.com/en-us/azure/aks/cluster-container-registry-integration says:

There is a latency issue with Azure Active Directory groups when attaching ACR (...) there may be a delay before the RBAC group takes effect.

I understand it, but it has been more than 12h since creating the role assignments.

Question: Is this typical to wait that long?


I attempted to troubleshoot the problem following https://learn.microsoft.com/en-us/azure/role-based-access-control/troubleshooting#symptom---role-assignments-with-identity-not-found

$ az role assignment list --scope /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso
[
  {
    "condition": null,
    "conditionVersion": null,
    "createdBy": "236564d1-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "createdOn": "2023-05-22T20:58:53.717018+00:00",
    "delegatedManagedIdentityResourceId": null,
    "description": "",
    "id": "/subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso/providers/Microsoft.Authorization/roleAssignments/fa76709d-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "name": "fa76709d-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "principalId": "2f850a88-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "principalName": "",
    "principalType": "ServicePrincipal",
    "resourceGroup": "my-acr-contoso",
    "roleDefinitionId": "/subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Authorization/roleDefinitions/7f951dda-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "roleDefinitionName": "AcrPull",
    "scope": "/subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso",
    "type": "Microsoft.Authorization/roleAssignments",
    "updatedBy": "236564d1-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "updatedOn": "2023-05-22T20:58:53.717018+00:00"
  },
  {
    "condition": null,
    "conditionVersion": null,
    "createdBy": "236564d1-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "createdOn": "2023-05-22T20:58:53.770669+00:00",
    "delegatedManagedIdentityResourceId": null,
    "description": "",
    "id": "/subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso/providers/Microsoft.Authorization/roleAssignments/f188bf19-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "name": "f188bf19-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "principalId": "7e49d45b-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "principalName": "",
    "principalType": "ServicePrincipal",
    "resourceGroup": "my-acr-contoso",
    "roleDefinitionId": "/subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Authorization/roleDefinitions/7f951dda-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "roleDefinitionName": "AcrPull",
    "scope": "/subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso",
    "type": "Microsoft.Authorization/roleAssignments",
    "updatedBy": "236564d1-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "updatedOn": "2023-05-22T20:58:53.770669+00:00"
  }
]

Question: Does this empty principalName indicate I should keep waiting?


I also attempted the troubleshooting according to https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/cannot-pull-image-from-acr-to-aks-cluster

$ az aks check-acr --resource-group my-aks-contoso-dev --name aks-contoso-uks-dev-aks --acr my-contoso.azurecr.io
Merged "aks-contoso-uks-dev-aks" as current context in C:\Users\mateuszl\AppData\Local\Temp\tmpgrz7ae2s
[2023-05-23T11:15:33Z] Checking host name resolution (my-contoso.azurecr.io): SUCCEEDED
[2023-05-23T11:15:33Z] Canonical name for ACR (my-contoso.azurecr.io): r0509uks.uksouth.cloudapp.azure.com.
[2023-05-23T11:15:33Z] ACR location: uksouth
[2023-05-23T11:15:33Z] Checking managed identity...
[2023-05-23T11:15:33Z] Kubelet managed identity client ID: 7e49d45b-xxxx-xxxx-xxxx-xxxxxxxxxxxx
[2023-05-23T11:15:33Z] Validating managed identity existance: SUCCEEDED
[2023-05-23T11:15:35Z] Validating image pull permission: FAILED
[2023-05-23T11:15:35Z] ACR my-contoso.azurecr.io rejected token exchange: ACR token exchange endpoint returned error status: 401. body:
$ az role assignment list  --scope /subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso --output table
Principal   Role    Scope
----------- ------- ---------------------------------------------------------------------------------------------------------------------------------------------
            AcrPull /subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso
            AcrPull /subscriptions/4629e9b5-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/my-acr-contoso/providers/Microsoft.ContainerRegistry/registries/my-contoso

Question: Does this empty Principal also indicate I should keep waiting?

mloskot commented 1 year ago

Interestingly, when I go to portal.azure.com > my container registry > Access control (IAM) > Check access > Find > Managed identity > User-assigned managed identity > select one of my aks-*-agentpool-s that is one corresponding with the "principalId": "2f850a88-xxxx-xxxx-xxxx-xxxxxxxxxxxx" above, then I'm getting very different result than from the az role assignment list command above:

image

mloskot commented 1 year ago

Solved!

I assigned role using Client ID of aks-*-agentpool managed identity of my AKS clusters, instead of Object (principal) ID:

resource "azurerm_role_assignment" "aks_acr_pull_allowed" {
  principal_id                = ...I put Client ID of AKS managed identity instead of Object (principal) ID...
  role_definition_name = "AcrPull"
...
}

As soon as I corrected my Terraform code, applied, then my ACR shows the expected identities and my AKS clusters can pull images from my ACR.

Apologies for the false issue report.

OTOH, this could be added to the catalogue of issues in the troubleshooting guide :)

I owe huge thanks to @alexeldeib for his great help via #provider-azure channel on Kubernetes Slack.