Azure / AML-Kubernetes

AzureML customer managed k8s compute samples
MIT License
80 stars 32 forks source link

Registering Microsoft.AzureML.Kubernetes fails with Authorization error #280

Closed JarkoDubbeldam closed 1 year ago

JarkoDubbeldam commented 1 year ago

I am trying to connect a brand new AKS cluster (https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli) to Azure ML. However, the step where I have to install the ML extension into AKS fails with an error I can't find anywhere in the troubleshooting guides. If this is the wrong repository for the error, I'm sorry.

PS H:\> az k8s-extension create --name azureml --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True inferenceLoadBalancerHA=False --cluster-type managedClusters --cluster-name myAKSCluster --resource-group hosting-pocs --scope cluster
Troubleshooting: https://aka.ms/arcmltsg
SSL is not enabled. Allowing insecure connections to the deployed services.
(ExtensionOperationFailed) The extension operation failed with the following error:  Request failed to https://management.azure.com/subscriptions/aa6994bb-1b63-4ec8-bc7c-63cc99c5c56a/resourceGroups/hosting-pocs/providers/Microsoft.ContainerService/managedclusters/myAKSCluster/extensionaddons/azureml?api-version=2021-03-01. Error code: Unauthorized. Reason: Unauthorized.{"error":{"code":"InvalidAuthenticationToken","message":"The received access token is not valid: at least one of the claims 'puid' or 'altsecid' or 'oid' should be present. If you are accessing as application please make sure service principal is properly created in the tenant."}}.

It's unclear to me what authorization is an issue here.

Version info:

PS H:\> az version
{
  "azure-cli": "2.38.0",
  "azure-cli-core": "2.38.0",
  "azure-cli-telemetry": "1.0.6",
  "extensions": {
    "aks-preview": "0.5.29",
    "containerapp": "0.3.7",
    "datafactory": "0.5.0",
    "k8s-extension": "1.4.0"
  }
}
jiaochenlu commented 1 year ago

Hi @JarkoDubbeldam, are you using a service principal for your AKS clusters to access other Azure Active Directory (Azure AD) resources? we are now do not support service principal with AKS, more detail you can see the limitaions of azureml-extension.

JarkoDubbeldam commented 1 year ago

I created the cluster with the --enable-managed-identity flag in the azurecli. So from what I can tell that should be in the clear.

alipek commented 1 year ago

Hi @jiaochenlu , I have same issue. I created AKS cluster by python SDK with code attached below and run install extension by azure cli .

def create_cluster(
        subscription_id: str,
        resource_group: str,
        cluster_name: str,
        location: str,
        app_id: str,
        app_secret: str,
):
    client:ContainerServiceClient = _manage_client_factory(subscription_id)

    mc_models = client.managed_clusters.models
    pooler: LROPoller = client.managed_clusters.begin_create_or_update(
        resource_group,
        cluster_name,
        parameters=mc_models.ManagedCluster(
            identity=mc_models.ManagedClusterIdentity(
                type=mc_models.ResourceIdentityType.system_assigned
            ),
            location=location,
            dns_prefix=cluster_name,
            agent_pool_profiles=[
                mc_models.ManagedClusterAgentPoolProfile(
                    name="default1",
                    count=1,
                    vm_size=mc_models.ContainerServiceVMSizeTypes.STANDARD_B2_S,
                    mode=mc_models.AgentPoolMode.SYSTEM,
                    scale_set_priority=mc_models.ScaleSetPriority.REGULAR,
                ),
                mc_models.ManagedClusterAgentPoolProfile(
                    name="gpuproc1",
                    count=0,
                    vm_size='Standard_NC4as_T4_v3',
                    mode=mc_models.AgentPoolMode.USER,
                    scale_set_priority=mc_models.ScaleSetPriority.SPOT,
                )
            ],
            # addon_profiles=[
            #     mc_models.ManagedClusterAddonProfile(
            #         enabled=True,
            #         config=
            #     ),
            # ],
        ),
    )

    manage_cluser = pooler.result()
$ az k8s-extension create --cluster-name [cluster-name] --cluster-type managedClusters --resource-group [resource-group-name] --scope cluster --extension-type Microsoft.AzureML.Kubernetes  --name azure-ml --config   enableTraining=False enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True inferenceLoadBalancerHA=False --cluster-type managedClusters

    Troubleshooting: https://aka.ms/arcmltsg
SSL is not enabled. Allowing insecure connections to the deployed services.
'Extensions' cannot be used because 'Microsoft.KubernetesConfiguration' provider has not been registered.More details for registering this provider can be found here - https://aka.ms/RegisterKubernetesConfigurationProvider
(ExtensionOperationFailed) The extension operation failed with the following error:  Request failed to https://management.azure.com/subscriptions/[subscription_id]/resourceGroups/smarteye-ml/providers/Microsoft.ContainerService/managedclusters/[cluster-name]/extensionaddons/azure-ml?api-version=2021-03-01. Error code: Unauthorized. Reason: Unauthorized.{"error":{"code":"InvalidAuthenticationToken","message":"The received access token is not valid: at least one of the claims 'puid' or 'altsecid' or 'oid' should be present. If you are accessing as application please make sure service principal is properly created in the tenant."}}.
Code: ExtensionOperationFailed
Message: The extension operation failed with the following error:  Request failed to https://management.azure.com/subscriptions/[subscription_id]/resourceGroups/smarteye-ml/providers/Microsoft.ContainerService/managedclusters/ne-aks-mlflow-sandbox/extensionaddons/azure-ml?api-version=2021-03-01. Error code: Unauthorized. Reason: Unauthorized.{"error":{"code":"InvalidAuthenticationToken","message":"The received access token is not valid: at least one of the claims 'puid' or 'altsecid' or 'oid' should be present. If you are accessing as application please make sure service principal is properly created in the tenant."}}.
xinyuezhang1 commented 1 year ago

@JarkoDubbeldam @alipek It seems that this issue occurred before the k8s-extension was installed, we have contacted the relevant team to investigate the cause of the error, and I will follow up here.

NarayanThiru commented 1 year ago

This happens if the Microsoft.KubernetesConfiguration ResourceProvider is not registered for the Subscription. Please register this resourceProvider, in your Subscription and confirm the registration status changes to 'Registered'.
Now, pl. create a new AKS Cluster and install the extension.

xinyuezhang1 commented 1 year ago

Thanks for the solution provided by @NarayanThiru

@JarkoDubbeldam @alipek Hi, have you created k8s-extension successfully now? Could we mitigate this issue now?

alipek commented 1 year ago

Thanks @NarayanThiru this working now when I register ResourceProvider

JarkoDubbeldam commented 1 year ago

That provider was already registered in my subscription. I did recreate the entire cluster just now, and for some reason it does work now. Not sure what changed exactly. I guess this can stay closed. Below is my complete script for completeness sake:

az provider show -n Microsoft.KubernetesConfiguration -o table
az group create -n aks-poc --location westeurope
az aks create -g aks-poc -n myAKSCluster --enable-managed-identity `
    --node-count 1 --enable-addons monitoring --enable-msi-auth-for-monitoring  `
    --generate-ssh-keys
az k8s-extension create --name azureml --extension-type Microsoft.AzureML.Kubernetes `
    --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer `
    allowInsecureConnections=True inferenceLoadBalancerHA=False --cluster-type managedClusters `
    --cluster-name myAKSCluster --resource-group aks-poc --scope cluster