sagemaker endpoint can't be deleted if it is stuck in creating state due to resource limit

rbavery commented 2 years ago

please address this issue. if a sagemaker endpoint deployment hits a resource limit, it gets stuck forever and there is no option to delete it: https://stackoverflow.com/questions/65678237/sagemaker-endpoint-stuck-at-creating

Rebecasarai commented 2 years ago

Hi! This is happening to me right now

jonasdebeukelaer commented 2 years ago

Currently experiencing a version of this too.

In this case due to docker host platform being wrong and the python server not being able to start properly.

chiayiffg commented 2 years ago

I was trying to create a sagemaker endpoint using terraform CLI and I ctrl + c my command line halfway when the endpoint is being created because I found some errors in my terraform script, right now my endpoint is stucked at Creating status and I have no way of deleting it.

Anyone found a solution to this?

jainalphin99 commented 2 years ago

Hi, I am facing the same issue, has anyone found a solution to this. Thanks

chiayiffg commented 2 years ago

hi @jainalphin99 , I went back to my AWS UI after 1 day and the status of the endpoint automatically changed from Creating to Failed , then I can delete the endpoint again, maybe try deleting it tomorrow and work with a new endpoint for now

asosnovsky-sumologic commented 1 year ago

this is so annoying when it happens, it kills 20min or more just for cases where you miscofigured the docker host

zionsofer commented 1 year ago

For us it happened because of missing model or endpoint configuration, which caused it to get stuck on Creating, failing only after 1 hour! There needs to be some way to force delete endpoints, because it can't be that misconfigurations cause a blocker for such a long time.

Also saw this happened during usage of serverless endpoints with correct configurations and everything. Not sure what's going on but it stops us from using SageMaker in our CD to start endpoints on demand

david-waterworth commented 1 year ago

I'm seeing the same thing, in the logs it keeps trying and failing to install the same package over and over

chiayiffg commented 1 year ago

I experienced the same with @asosnovsky-sumologic and @zionsofer, whenever I accidentally forget to install a certain package or when I misconfigured any aws role my code needs to retrieve, I will need to wait close to 20 mins for the status to become fail before I can redeploy again

@david-waterworth From what I observe, when you first deploy the endpoint, sagemaker will try to call your /ping endpoint, even if something is failing, the /ping health check will continue for 20mins (from my experience) before the endpoint gives up and return Failed status

To unblock yourself, I suggest working with a new endpoint with different name first because waiting for 20mins to be able to redeploy a change again is quite time consuming and demotivating

rbp15 commented 1 year ago

Agree why can't a sig be sent to the server to terminate the running processes and send it into a failed state? It's currently trying the same thing over and over again.

asosnovsky-sumologic commented 1 year ago

@chiayiffg that is what I do, but when you try to iterate like this, you end up creating like 10 endpoints that you have to then remeber to delete after some time.

david-waterworth commented 1 year ago

@chiayiffg I think that happens for certain kinds of failures, for me because it gets stuck in an infinite loop building the container I don't think it ever does the health ping. I think there's another timeout that's longer than 20 mins (maybe an hour). I've logged an issue for my specific problem so hopefully, it'll be addressed one day.

grraffe commented 1 year ago

+1 because it can be stuck when endpoint uses Graviton2 without using arm64 docker image.

ylhsieh commented 1 year ago

+1 no way to interrupt creating an endpoint is really a waste of time. Feels like it is a way to get more money from users waiting for timeout to finish.

cceyda commented 1 year ago

+1 currently stuck at creating for 20+ minutes because it can't find pip package(due to typo). Can't stop, can't delete... the more I use sagemaker the more I'm shocked how much obvious basic functionality is missing

em-eman commented 1 year ago

it is quite annoying and i have to delete endpoint config in order to make it fail before 20mins

Rhuax commented 1 year ago

Just got in the same situation.. I think it is a critical feature to implement!!

bjmrevilla commented 1 year ago

would really love for this to be added. Currently stuck right now in Creating status. :(

danb27 commented 1 year ago

+1

RahulJana commented 1 year ago

I tried creating sagemaker endpoint from the notebook instance, It stuck in that state for ~30 mins. So I gave a keyboard interrupt in the notebook cell, to stop that process. Now it is stuck at creating status.

nisalupendra commented 1 year ago

Still the same issue, couldn't agree more with @ylhsieh , probably a try at scraping that extra penny from the customer, otherwise don't see why there cannot be kill switch, this doesn't depend on anything except the model and the modelConfig which are both static resources.

bcarsley commented 1 year ago

Yup same here! really sucks -- I assume the timeout on the creating => failed is in part based on the parameter "container_startup_health_check_timeout" (at least in my case since I'm trying to deploy a fine-tuned LLaMA as a HuggingFacePredictor) ... still they should do something about it! been waiting here for about 45mins + smh

bcarsley commented 1 year ago

It does seem like deleting the model image that the deployment is based upon frees up the resources slightly faster than waiting for "Failed", as you throw an arn error by deleting the model a deployment in the "creating" status is working off of (which then enables you to delete the deployment, at least in my case)... a very risky and probably inadvisable workaround!

Update: the key is actually to remove both the CloudWatch logs and the model itself ... still a hacky workaround, but it does successfully cut short failing runs (AWS seems to monitor health via CloudWatch log streams so deleting that seems to help expedite the whole process) ... still they need to fix this!

davidshhh commented 11 months ago

It doesn't seem like sagemaker is made for production systems honestly.

orcaman commented 11 months ago

Same issue. Can the SageMaker team fix this? It's really annoying

jankrepl commented 10 months ago

+1

I created a model, endpoint config and endpoint via terraform apply and since the endpoint was taking forever to create I simply terraform destroy everything - however, it is impossible to delete the endpoint (removing the model and endpoint config went through).

This happened to me multiple times in the past.

MarkoMilos commented 9 months ago

How did you solve this problem?

I have a SageMaker Endpoint that is "In Service" with InferenceComponent that is stuck in the "Creating" state. In order to delete the endpoint (to avoid charges for the underlying instance) I need to delete the InferenceComponent - but I can't while it is in the "Creating" state - it's been like that for almost 2 days now - clearly bugged. In the meantime I'm charged almost 2USD per hour.

I can't delete it via CLI, nor via AWS Console, nor via SageMaker Studio... -.-

See below

aws sagemaker list-endpoints

{
    "Endpoints": [
        {
            "EndpointName": "llama2-endpoint",
            "EndpointArn": "arn:aws:sagemaker:eu-central-1:592532275118:endpoint/llama2-endpoint",
            "CreationTime": "2023-12-19T16:03:34.976000+01:00",
            "LastModifiedTime": "2023-12-19T16:05:22.443000+01:00",
            "EndpointStatus": "InService"
        }
    ]
}

aws sagemaker list-inference-components

{
    "InferenceComponents": [
        {
            "CreationTime": "2023-12-19T16:22:21.345000+01:00",
            "InferenceComponentArn": "arn:aws:sagemaker:eu-central-1:592532275118:inference-component/llama2-7b-20231219-152221",
            "InferenceComponentName": "llama2-7b-20231219-152221",
            "EndpointArn": "arn:aws:sagemaker:eu-central-1:592532275118:endpoint/llama2-endpoint",
            "EndpointName": "llama2-endpoint",
            "VariantName": "variant-1",
            "InferenceComponentStatus": "Creating",
            "LastModifiedTime": "2023-12-19T16:22:22.333000+01:00"
        }
    ]
}

If I try to delete it aws sagemaker delete-inference-component --inference-component-name "llama2-7b-20231219-152221" The error returned is


An error occurred (ValidationException) when calling the DeleteInferenceComponent operation: Cannot delete inference component "arn:aws:sagemaker:eu-central-1:592532275118:inference-component/llama2-7b-20231219-152221" while it is in state "CREATE_IN_PROGRESS".```

sagar1001 commented 9 months ago

Same issue for me as well. Unable to delete it due to Status CREATE_IN_PROGRESS. Any option to set inference component status to Failed or deleted from cli?

Any solutions, please suggest.

Seven2Nine commented 9 months ago

+1 no way to interrupt creating an endpoint is really a waste of time. Feels like it is a way to get more money from users waiting for timeout to finish.

you are right, A year later aws still hasn't fixed this issue

zdev24 commented 9 months ago

I had the similar issue:

The inference endpoint has the status of "InService".
There is no Model associated.
There is no endpoint configuration associated with the "InService" endpoint. Maybe it was removed when I click delete the model. However, it still cost me $18 after 1 day and maybe will continue costing in the next day. I can not delete the endpoint even with CLI, the error message: "An error occurred (ValidationException) when calling the DeleteEndpoint operation: Cannot delete endpoint with inference component associated. Please delete inference component and try it again."

MarkoMilos commented 9 months ago

I've upgraded my AWS support plan to Premium in order to find out what is going on and how to delete the endpoint.

After a video call, I got confirmation from AWS technical support that this is indeed a known issue on their side. They do not have a clear ETA when the issue will be fixed.

Apparently, they will refund me for the generated cost but not while resources are running and the issue is not fixed (for which they do not have any ETA - which does not give me that much hope since this issue was opened in May 2022 and is still active).

Meanwhile, until they fix the issue I'm charged 50$ USD per day. So far with SageMaker I feel like a Beta tester that has to pay for testing clearly bugged service...

zdev24 commented 9 months ago

@MarkoMilos I created a ticket and I received the guideline to delete the Domain in Sagemaker. They explained that the endpoint does not cause the cost, but the Domain with running Jupyter Studio app. I need to wait for 1 more day to see if it continue cost, however, you can try.

Another note: we can create a Case in Support Center, choose the Billing (instead of Technical) to get the support by creating Case, we do not need to upgrade the Support Plan. I still keep Basic Plan but get the response just after hours.

sagar1001 commented 9 months ago

My issue got fixed yesterday. AWS unblocked and failed the IC from their end and asked me to delete IC and end point. This time it worked, and processed the billing adjustment.

ziadbadwy commented 7 months ago

i faced same problem at model deployment and i customized the time from container_startup_health_check_timeout

from sagemaker.huggingface import HuggingFaceModel

llm = llm_model.deploy( initial_instance_count=1, instance_type=instance_type, container_startup_health_check_timeout=300) # 10 minutes to be able to load the model