Azure / azure-cli-extensions

Public Repository for Extensions of Azure CLI.
https://docs.microsoft.com/en-us/cli/azure
MIT License
384 stars 1.25k forks source link

az k8s-extension create --extension-type Microsoft.AzureML.Kubernetes leads to ExtensionOperationFailed when attempting to deploy Inference model #5477

Open RoFz opened 2 years ago

RoFz commented 2 years ago

Describe the bug

Command Name az k8s-extension create Extension Name: k8s-extension. Version: 1.3.5.

Errors:

(ExtensionOperationFailed) The extension operation failed with the following error:  Error: {Helm installation from path [] for release [azuremlsecure] failed with the following error: err [release azuremlsecure failed, and has been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]} occurred while doing the operation : {Installing the extension} on the config.
Code: ExtensionOperationFailed
Message: The extension operation failed with the following error:  Error: {Helm installation from path [] for release [azuremlsecure] failed with the following error: err [release azuremlsecure failed, and has been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]} occurred while doing the operation : {Installing the extension} on the config.

To Reproduce:

Steps to reproduce the behavior. Note that argument values have been redacted, as they may contain sensitive information.

Expected Behavior

Error level = 0 (success) and a similar output:

{ "aksAssignedIdentity": { "principalId": "<obj_id>", "tenantId": null, "type": null }, "autoUpgradeMinorVersion": true, "configurationProtectedSettings": { "scoringFe.sslCert": "", "scoringFe.sslKey": "", "sslCertPemFile": "", "sslKeyPemFile": "" }, "configurationSettings": { "allowInsecureConnections": "False", "clusterId": "/subscriptions/<sub_id>/resourceGroups/<rg_name>/providers/Microsoft.ContainerService/managedClusters/<cluster_name>", "clusterPurpose": "DevTest", "cluster_name": "/subscriptions/<sub_id>/resourceGroups/<rg_name>/providers/Microsoft.ContainerService/managedClusters/<cluster_name>", "cluster_name_friendly": "<cluster_name>", "domain": "westeurope.cloudapp.azure.com", "enableInference": "True", "enableTraining": "True", "inferenceRouterHA": "false", "inferenceRouterServiceType": "clusterIP", "jobSchedulerLocation": "westeurope", "location": "westeurope", "nginxIngress.enabled": "true", "prometheus.prometheusSpec.externalLabels.cluster_name": "/subscriptions/<sub_id>/resourceGroups/<rg_name>/providers/Microsoft.ContainerService/managedClusters/<cluster_name>", "relayserver.enabled": "false", "servicebus.enabled": "false", "sslCname": "azureml.<domain>.com" }, "customLocationSettings": null, "errorInfo": null, "extensionType": "microsoft.azureml.kubernetes", "id": "/subscriptions/<sub_id>/resourceGroups/<rg_name>/providers/Microsoft.ContainerService/managedClusters/<cluster_name>/providers/Microsoft.KubernetesConfiguration/extensions/azuremlsecure", "identity": null, "installedVersion": null, "name": "azuremlsecure", "packageUri": null, "provisioningState": "Succeeded", "releaseTrain": "stable", "resourceGroup": "<rg_name>", "scope": { "cluster": { "releaseNamespace": "azureml" }, "namespace": null }, "statuses": [], "systemData": { "createdAt": "2022-10-21T10:33:39.527695+00:00", "createdBy": null, "createdByType": null, "lastModifiedAt": "2022-10-21T10:33:39.527695+00:00", "lastModifiedBy": null, "lastModifiedByType": null }, "type": "Microsoft.KubernetesConfiguration/extensions", "version": "1.1.12" }

Environment Summary

Linux-5.15.0-1017-azure-x86_64-with-glibc2.31, Ubuntu 20.04.5 LTS
Python 3.10.5
Installer: DEB

azure-cli 2.40.0 *

Extensions:
k8s-extension 1.3.5
ml 2.10.0

Dependencies:
msal 1.18.0b1
azure-mgmt-resource 21.1.0b1

Additional Context

none

How can one get access to the underlying helm chart of this cli wrapper to properly troubleshoot this?

ghost commented 2 years ago

Thank you for your feedback. This has been routed to the support team for assistance.

yonzhan commented 2 years ago

route to CXP team

ghost commented 2 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm.

Issue Details
## Describe the bug **Command Name** `az k8s-extension create Extension Name: k8s-extension. Version: 1.3.5.` **Errors:** ``` (ExtensionOperationFailed) The extension operation failed with the following error: Error: {Helm installation from path [] for release [azuremlsecure] failed with the following error: err [release azuremlsecure failed, and has been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]} occurred while doing the operation : {Installing the extension} on the config. Code: ExtensionOperationFailed Message: The extension operation failed with the following error: Error: {Helm installation from path [] for release [azuremlsecure] failed with the following error: err [release azuremlsecure failed, and has been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]} occurred while doing the operation : {Installing the extension} on the config. ``` ## To Reproduce: Steps to reproduce the behavior. Note that argument values have been redacted, as they may contain sensitive information. - allowInsecureConnections=False `az k8s-extension create \ --name azuremlsecure \ --extension-type Microsoft.AzureML.Kubernetes \ --config enableTraining=True \ enableInference=True \ allowInsecureConnections=False \ inferenceRouterServiceType=clusterIP \ sslCname=azureml..com \ nodeSelector.azureml=true \ --config-protected sslCertPemFile=azureml-cert.pem \ sslKeyPemFile=azureml-key.pem \ --cluster-type managedClusters \ --cluster-name $aksname \ --resource-group $aksrgname \ --scope cluster` - allowInsecureConnections=True `az k8s-extension create \ --name azuremlinsecure \ --extension-type Microsoft.AzureML.Kubernetes \ --config enableTraining=True \ enableInference=True \ allowInsecureConnections=True \ inferenceRouterServiceType=clusterIP \ nodeSelector.azureml=true \ --cluster-type managedClusters \ --cluster-name $aksname \ --resource-group $aksrgname \ --scope cluster` - also tried with inferenceRouterServiceType=loadbalancer to no avail - with enableInference=False, deployment works, but this does not address the business requirement ## Expected Behavior Error level = 0 (success) and a similar output: ` { "aksAssignedIdentity": { "principalId": "", "tenantId": null, "type": null }, "autoUpgradeMinorVersion": true, "configurationProtectedSettings": { "scoringFe.sslCert": "", "scoringFe.sslKey": "", "sslCertPemFile": "", "sslKeyPemFile": "" }, "configurationSettings": { "allowInsecureConnections": "False", "clusterId": "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters/", "clusterPurpose": "DevTest", "cluster_name": "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters/", "cluster_name_friendly": "", "domain": "westeurope.cloudapp.azure.com", "enableInference": "True", "enableTraining": "True", "inferenceRouterHA": "false", "inferenceRouterServiceType": "clusterIP", "jobSchedulerLocation": "westeurope", "location": "westeurope", "nginxIngress.enabled": "true", "prometheus.prometheusSpec.externalLabels.cluster_name": "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters/", "relayserver.enabled": "false", "servicebus.enabled": "false", "sslCname": "azureml..com" }, "customLocationSettings": null, "errorInfo": null, "extensionType": "microsoft.azureml.kubernetes", "id": "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters//providers/Microsoft.KubernetesConfiguration/extensions/azuremlsecure", "identity": null, "installedVersion": null, "name": "azuremlsecure", "packageUri": null, "provisioningState": "Succeeded", "releaseTrain": "stable", "resourceGroup": "", "scope": { "cluster": { "releaseNamespace": "azureml" }, "namespace": null }, "statuses": [], "systemData": { "createdAt": "2022-10-21T10:33:39.527695+00:00", "createdBy": null, "createdByType": null, "lastModifiedAt": "2022-10-21T10:33:39.527695+00:00", "lastModifiedBy": null, "lastModifiedByType": null }, "type": "Microsoft.KubernetesConfiguration/extensions", "version": "1.1.12" } ` ## Environment Summary ``` Linux-5.15.0-1017-azure-x86_64-with-glibc2.31, Ubuntu 20.04.5 LTS Python 3.10.5 Installer: DEB azure-cli 2.40.0 * Extensions: k8s-extension 1.3.5 ml 2.10.0 Dependencies: msal 1.18.0b1 azure-mgmt-resource 21.1.0b1 ``` ## Additional Context none How can one get access to the underlying helm chart of this cli wrapper to properly troubleshoot this?
Author: RoFz
Assignees: -
Labels: `extension/aks-preview`, `customer-reported`, `AKS`, `Service Attention`, `CXP Attention`
Milestone: -
navba-MSFT commented 2 years ago

@RoFz I guess you have raised a support ticket too for the same issue and working with our Support Professional.

To assist you further, I am adding Service Team to look into this.

@Azure/aks-pm Could you please look into this and provide an update ?