Azure / AML-Kubernetes

AzureML customer managed k8s compute samples
MIT License
80 stars 32 forks source link

Cannot Attach AKS Cluster with User Assigned Workspace Identity #279

Closed jroskens-mgm closed 1 year ago

jroskens-mgm commented 1 year ago

I am unable to find a working method of attaching an AKS cluster when a ML workspace was provisioned with a user identity.

I followed the documentation under User-assigned managed identity to assign the appropriate roles over the Key Vault, Storage Account, ACR and App Insights resources I created ahead of time. I then assigned the “Reader” role to the identity over the AKS cluster scope as mentioned under the Prerequisite section. Following this documentation does not appear to result in a success however.

I originally attempted this by terraforming all resources, but switched over to using the CLI after opening a support case so it was easier to share the repo steps.

  1. Create and AKS cluster
    
    # Create resource group
    az group create --name rg-aks --location westus

Create VNET

VNET_ID=$(az network vnet create --name vnet-aks --resource-group rg-aks --location westus --address-prefix 10.0.0.0/20 --subnet-name subnet-aks --subnet-prefixes 10.0.0.0/24 --query newVNet.id -otsv) SUBNET_ID=$(az network vnet subnet show -g rg-aks -n subnet-aks --vnet-name vnet-aks --query id -otsv)

Create AKS Control Plane Identity

AKS_PRINCIPAL_ID=$(az identity create -g rg-aks -n identity-aks --query principalId -otsv) AKS_IDENTITY_ID=$(az identity show -g rg-aks -n identity-aks --query id -otsv)

Create Kubelet Identity

KUBELET_ID=$(az identity create -g rg-aks -n identity-kubelet --query id -otsv)

Hack to avoid "Cannot find user or service principal in graph database" which can happen if you try to assign roles immediately after creating the identity

sleep 30

Assign Managed Identity Role to AKS Control Plane Identity over Kubelet identity

az role assignment create --assignee $AKS_PRINCIPAL_ID --role "Managed Identity Operator" --scope "$KUBELET_ID"

Assign Network Contributor to AKS Control Plane / Cluster Identity for subnet aks will be assigned.

az role assignment create --assignee $AKS_PRINCIPAL_ID --role "Network Contributor" --scope "$VNET_ID"

Create the AKS cluster

az aks create \ --resource-group rg-aks \ --name aks-ml-cluster \ --network-plugin kubenet \ --vnet-subnet-id $SUBNET_ID \ --docker-bridge-address 172.17.0.1/16 \ --dns-service-ip 10.2.0.10 \ --service-cidr 10.2.0.0/24 \ --enable-managed-identity \ --assign-identity $AKS_IDENTITY_ID \ --assign-kubelet-identity $KUBELET_ID \ --node-count 1 \ --generate-ssh-keys

Install the k8s-extension

az k8s-extension create --name aml-extension \ --extension-type Microsoft.AzureML.Kubernetes \ --scope cluster \ --cluster-name aks-ml-cluster \ --resource-group rg-aks \ --config enableTraining=True \ enableInference=True \ enableTraining=False \ allowInsecureConnections=True \ inferenceRouterServiceType=loadBalancer \ inferenceRouterHA=false \ internalLoadBalancerProvider=azure \ --cluster-type managedClusters


2.  Create Resource Group, Key Vault, Storage Account, ACR and App Insights resources 
```# Create resource group for the ML workspace
az group create --name rg-ml-workspace --location westus

# Create Workspace Identity
WORKSPACE_PRINCIPAL_ID=$(az identity create -g rg-ml-workspace -n identity-ml-workspace --query principalId -otsv)
WORKSPACE_IDENTITY_ID=$(az identity show -g rg-ml-workspace -n identity-ml-workspace --query id -otsv)

# Create ACR
rnd=$((10000 + $RANDOM % 99999))
ACR_ID=$(az acr create -n acrml${rnd} -g rg-ml-workspace --sku Standard --admin-enabled --query id -otsv)

#Create App Insights
APP_INSIGHTS_ID=$(az monitor app-insights component create --app app-insights-ml --location westus -g rg-ml-workspace --retention-time 30 --query id -otsv)

# Create Storage Account
rnd=$((10000 + $RANDOM % 99999))
STORAGE_ID=$(az storage account create -n samlworkspace${rnd} -g rg-ml-workspace -l westus --sku Standard_LRS --allow-blob-public-access false --query id -otsv)

#Create Key Vault
rnd=$((10000 + $RANDOM % 99999))
KEYVAULT_ID=$(az keyvault create --name "keyvaultml${rnd}" --resource-group rg-ml-workspace --location westus --enable-rbac-authorization true --query id -otsv)
  1. Create User Identity for ML Workspace and assign roles from User-assigned managed identity
    # Contributor over Resource Group
    az role assignment create --assignee "$WORKSPACE_PRINCIPAL_ID" --role "Contributor" --resource-group rg-ml-workspace
    # Assign Contributor over Storage Account
    az role assignment create --assignee "$WORKSPACE_PRINCIPAL_ID" --role "Contributor" --scope "$STORAGE_ID"
    # Assign "Storage Blob Data Contributor" over Storage Account
    az role assignment create --assignee "$WORKSPACE_PRINCIPAL_ID" --role "Storage Blob Data Contributor" --scope "$STORAGE_ID"
    # Assign "Key Vault Administrator" over Key Vault
    az role assignment create --assignee "$WORKSPACE_PRINCIPAL_ID" --role "Key Vault Administrator" --scope "$KEYVAULT_ID"
    # Assign Contributor over ACR
    az role assignment create --assignee "$WORKSPACE_PRINCIPAL_ID" --role "Contributor" --scope "$ACR_ID"
  2. Create the ML workspace in the portal selecting the resources and identity created in steps 1-3 following User-assigned managed identity. (there doesn’t seem to be a documented way to accomplish this with the CLI).
  3. Assign the “Reader” role to the ML user identity in step 3 over the AKS cluster scope as mentioned under the Prerequisite section.
    AKS_ID=$(az aks show -g rg-aks -n aks-ml-cluster --query id -otsv)
    az role assignment create --assignee "$WORKSPACE_PRINCIPAL_ID" --role "Reader" --scope "$AKS_ID"
  4. Attach the cluster as a compute resource.
    az ml compute attach --resource-group rg-ml-workspace --workspace-name ml-workspace --type Kubernetes \
    --name ml-inference \
    --resource-id "$AKS_ID"

    The above immediately fails with the error:

    (BadRequest) AKS role check failed for user assigned identity. Please check the role assignment.
    Code: BadRequest
    Message: AKS role check failed for user assigned identity. Please check the role assignment.

    I eventually got passed the "AKS role check failed" error by assigning both the "Reader" and "Azure Kubernetes Service Cluster Admin Role". I added the admin role because that's what I observed azure doing automatically to it's MSI when attaching. However, this still results in a failure, although it is different.

(BadRequest) Azure Machine Learning extension is not installed in this cluster /subscriptions/<Subscription ID>/resourcegroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster.
Code: BadRequest
Message: Azure Machine Learning extension is not installed in this cluster /subscriptions/<Subscription ID>/resourcegroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster.

These issues only seem to occur when bringing your own identity to a ML workspace. If you simply create a ML workspace and allow it to create its own Managed System Identity, then the cluster can be attached without any issues.

siyuZL commented 1 year ago

@jroskens-mgm I have tried the route you give, and I successfully attached the cluster. Can you check the user assigned identity for the workspace? Besides, according to the last block, I think your extension does not install successfully, you can try to re-install the extension.

jroskens-mgm commented 1 year ago

@siyuZL I deleted both resource groups and recreated everything following the steps exactly as I have them here. The result was almost the same as before. The only difference being that it took several minutes before I received the error instead of failing immediately.

(BadRequest) AKS role check failed for user assigned identity. Please check the role assignment.
Code: BadRequest
Message: AKS role check failed for user assigned identity. Please check the role assignment.

When you tried to recreate this issue, did you select the identity that is created in step 3? image

If I skip this step, and create a ML workspace with a MSI, then I can attach the AKS cluster without any issue. This is also why I believe the extension is installed successfully (in addition to it showing as succeeded).

  "extensionType": "microsoft.azureml.kubernetes",
  "id": "/subscriptions/<subscriptionid>/resourceGroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster/providers/Microsoft.KubernetesConfiguration/extensions/aml-extension",
  "identity": null,
  "isSystemExtension": false,
  "name": "aml-extension",
  "packageUri": null,
  "plan": null,
  "provisioningState": "Succeeded",

But, in the interest of science I created another ML workspace without specifying a user identity and attempted to attach the same AKS cluster that failed to attach above.

az ml workspace create -n ml-workspace-msi -g rg-ml-workspace \
    --set storage_account="$STORAGE_ID" \
        key_vault="$KEYVAULT_ID" \
        application_insights="$APP_INSIGHTS_ID" \
        container_registry="$ACR_ID"
AKS_ID=$(az aks show -g rg-aks -n aks-ml-cluster --query id -otsv)
az ml compute attach --resource-group rg-ml-workspace --workspace-name ml-workspace-msi --type Kubernetes \
    --name ml-inference \
    --resource-id "$AKS_ID"

The cluster was attached without any issues.

{
  "id": "/subscriptions/<subscription id>/resourceGroups/rg-ml-workspace/providers/Microsoft.MachineLearningServices/workspaces/ml-workspace-msi/computes/ml-inference",
  "location": "westus",
  "name": "ml-inference",
  "namespace": "default",
  "properties": {
    "default_instance_type": "defaultinstancetype",
    "extension_instance_release_train": "stable",
    "extension_principal_id": "031dcc21-9ee1-4004-b660-25c211f3ca34",
    "instance_types": {
      "defaultinstancetype": {
        "resources": {
          "limits": {
            "cpu": "2",
            "memory": "2Gi",
            "nvidia.com/gpu": null
          },
          "requests": {
            "cpu": "0.1",
            "memory": "500Mi",
            "nvidia.com/gpu": null
          }
        }
      }
    },
    "namespace": "default"
  },
  "provisioning_state": "Succeeded",
  "resourceGroup": "rg-ml-workspace",
  "resource_id": "/subscriptions/<subscription id>/resourcegroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster",
  "type": "kubernetes"
}

In the interest of clarity, I want to state that this does not resolve the issue. Attaching an AKS cluster does work correctly when the ML workspace is configured to use a managed system identity. However, I am unable to attach an AKS cluster to an ML workspace configured with a user assigned identity.

siyuZL commented 1 year ago

Hi @jroskens-mgm, I also checked this scenario twice and find a solution. Can you try to give your workspace's user assigned identity these roles?

  1. Grand Kubernetes Extension Contributor role to the "aks cluster" or "resource group"

  2. Grand Azure Kubernetes Service Cluster Admin Role role to the "aks cluster".

jroskens-mgm commented 1 year ago

@siyuZL - Still seeing the same error. I recreated everything from scratch, including the AKS cluster, and applied those two roles.

az ml compute attach --resource-group rg-ml-workspace --workspace-name ml-workspace --type Kubernetes \
    --name ml-inference \
    --resource-id "$AKS_ID"
(BadRequest) AKS role check failed for user assigned identity. Please check the role assignment.
Code: BadRequest
Message: AKS role check failed for user assigned identity. Please check the role assignment.

Here are the roles currently assigned to the identity of the ML workspace.

# Get the resource ID of the workspace's user assigned identity principal ID
WORKSPACE_PRINCIPAL_ID=$(az ml workspace show --resource-group rg-ml-workspace --name ml-workspace --query "(identity.user_assigned_identities.*.principal_id)[0]" -otsv)

# Display assigned roles for the Workspace's assigned user
az role assignment list --all --assignee "$WORKSPACE_PRINCIPAL_ID" --query "[].{roleDefinitionName:roleDefinitionName, scope:scope}" -o table
RoleDefinitionName                           Scope
-------------------------------------------  -------------------------------------------------------------------------------------------------------------------------------------------------
Contributor                                  /subscriptions/<subscription id>/resourceGroups/rg-ml-workspace
Contributor                                  /subscriptions/<subscription id>/resourceGroups/rg-ml-workspace/providers/Microsoft.Storage/storageAccounts/samlworkspace20092
Storage Blob Data Contributor                /subscriptions/<subscription id>/resourceGroups/rg-ml-workspace/providers/Microsoft.Storage/storageAccounts/samlworkspace20092
Key Vault Administrator                      /subscriptions/<subscription id>/resourceGroups/rg-ml-workspace/providers/Microsoft.KeyVault/vaults/keyvaultml37041
Contributor                                  /subscriptions/<subscription id>/resourceGroups/rg-ml-workspace/providers/Microsoft.ContainerRegistry/registries/acrml37396
Reader                                       /subscriptions/<subscription id>/resourcegroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster
Kubernetes Extension Contributor             /subscriptions/<subscription id>/resourcegroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster
Azure Kubernetes Service Cluster Admin Role  /subscriptions/<subscription id>/resourcegroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-ml-cluster
jroskens-mgm commented 1 year ago

I got it to work. The Workspace Identity must be granted the "Kubernetes Extension Contributor" to the AKS resource group. The cluster alone isn't enough.

It seems these are the minimum roles and scopes that must be added to the ML Workspace's User Assigned Identity in order to attach a cluster successfully.

RoleDefinitionName                           Scope
-------------------------------------------  -------------------------------------------------------------------------------------------------------------------------------------------------
Reader                                       AKS Cluster
Azure Kubernetes Service Cluster Admin Role  AKS Cluster
Kubernetes Extension Contributor             Resource Group of AKS Cluster

Glad I have it working now. I have to ask though, is this an undocumented requirement or a bug? The Prerequisite documentation lists only the Reader role is required. Needing Cluster Admin is much more than that.

siyuZL commented 1 year ago

This is an undocumented requirement. We will update the document to fix it. Thank you for the verify @jroskens-mgm!

jiaochenlu commented 1 year ago

Thanks siyu for this support, the document has updated, please refer to attach-to-workspace-with-user-assigned-managed-identity.