Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.38k stars 2.72k forks source link

Unable to re-create ML endpoint that was previously deleted (on Kubernetes cluster) #35109

Open aadadeyST opened 3 months ago

aadadeyST commented 3 months ago

Describe the bug When I create an endpoint using the python SDK (ml_client.online_endpoints.begin_create_or_update(endpoint).wait(120)) I receive the following error (details have been scrubbed):

azure.core.exceptions.HttpResponseError: (BadRequest) The request is invalid.
Code: BadRequest
Message: The request is invalid.
Exception Details:    (InferencingClientCallFailed) {{"errors":{{"":["OnlineEndpoint foo-bar already exists in cluster, in workspace xxxxxxxx of resource group xxxxxxxx. Please notice endpoint name must be unique per cluster."]}},"title":"One or more validation errors occurred."}}
  Code: InferencingClientCallFailed
  Message: {{"errors":{{"":["OnlineEndpoint foo-bar already exists in cluster, in workspace xxxxxxxx of resource group xxxxxxxx. Please notice endpoint name must be unique per cluster."]}},"title":"One or more validation errors occurred."}}
Additional Information:Type: ComponentName
Info: {
    "value": "managementfrontend"
}Type: Correlation
Info: {
    "value": {
        "operation": "bc1d6e888318fe996d47200abfd92d9a",
        "request": "f8ffa169b6cf3774"
    }
}Type: Environment
Info: {
    "value": "eastus"
}Type: Location
Info: {
    "value": "eastus"
}Type: Time
Info: {
    "value": "2024-04-08T16:29:22.2971502+00:00"
}

I had previously set up endpoint foo-bar on a Kubernetes cluster (we'll call it aks-one). I then detached aks-one and created a new one (aks-two). I tried to recreate the endpoint in the aks-two cluster but I received an error that it couldn't find aks-one, which had been detached. So I deleted the foo-bar endpoint (using the ML Studio UI), but when I ran the Python code to create the endpoint, it gave me the above error. I've checked the Kubernetes service and deleted all of the workloads, services, and configurations related to the previous foo-bar deployment but that didn't change anything. I also recreated aks-one and tried to create the endpoint there but still received the same error message.

Running az ml online-endpoint list against the workspace/resource group returns an empty list.

To Reproduce Steps to reproduce the behavior:

  1. Create an endpoint on a Kubernetes cluster and deploy an inference script
  2. Delete/detach the Kubernetes cluster
  3. Create a new Kubernetes cluster with a different name
  4. Delete the endpoint created in step 1
  5. Attempt to create a new endpoint with the same name as the previous one

Expected behavior I should be able to create the endpoint with the same name as the previously deleted one. When I create an endpoint, delete it and recreate it all on the same Kubernetes cluster, I have no issues. This only happens when two clusters are involved.

Screenshots There are currently no endpoints in my workspace: image

github-actions[bot] commented 3 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

kristapratico commented 3 months ago

Thanks for the detailed issue, @azureml-github will take a look and get back to you as soon as possible.