aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.14k stars 6.78k forks source link

Sagemaker Notebook Instances stuck at Stopping and Pending phases #1094

Open SpicySyntax opened 4 years ago

SpicySyntax commented 4 years ago

I am working with AWS sagemaker to automate model training in regulated and semi regulated environments. After making some changes today, I broke the instances (I needed to fix IAM roles, which I have since fixed). However, my notebooks instances go stuck at the Pending and Stopping phase respectively. I have waited a few hours and nothing has changed.

I have tried all of the commands below:

aws sagemaker start-notebook-instance --region us-east-1 --notebook-instance-name sagemaker-notebook-instance-187631372219
aws sagemaker stop-notebook-instance --region us-east-1 --notebook-instance-name sagemaker-notebook-instance-187631372219
aws sagemaker delete-notebook-instance --region us-east-1 --notebook-instance-name sagemaker-notebook-instance-187631372219

Unfortunately I am not able to get these to change state.

For Stopping State:

An error occurred (ValidationException) when calling the StartNotebookInstance operation: Status (Stopping) not in ([Stopped, Failed]). Unable to transition to (Pending) for Notebook Instance (arn:aws:sagemaker:us-east-1:187631372219:notebook-instance/sagemaker-notebook-instance-187631372219)

For Pending State:

An error occurred (ValidationException) when calling the StopNotebookInstance operation: Status (Pending) not in ([InService]). Unable to transition to (Stopping) for Notebook Instance (arn:aws:sagemaker:us-east-1:424869984157:notebook-instance/sagemaker-notebook-instance).

How can I stop and delete these instances so I can deploy the fixed versions?

ManuelRios18 commented 4 years ago

I also have a SageMaker Notebook stuck on pending status for more tan 3 hours knows. I don't have any copy of the code inside the instance .... Do you know how can I get the code ?

csmcallister commented 4 years ago

This might be related to #207, where the root cause was the notebook ec2 instance wasn't available (an ml.p2.xlarge in their case). However, the Pending status resolved after an undisclosed amount of time for them.

Dave-Vedant commented 4 years ago

It happened with me, the only advice is to wait and watch. I also searched for solution after 3 minute it resolved in my case. Actually, first check the region (is it correct or not? , of cause correct region give you access and visibility of notebook) . sometime the response time is longer due to network issue. As an example... in sagemaker log each work happens within seconds but you will informed after more than minute. Why? just a response time. Please check it again, May be it will resolved right now. Thank you.

moro-no-kimi commented 4 years ago

Just had out ml.t2.instance start after being stuck in the pending state for just under 2 hours. It seems that this should be a very easy problem to mitigate, hope Sagemaker releases a feature to force stop when in this state.

TheAustinator commented 3 years ago

Did anyone find the secret on this one? Having this problem with sagemaker studio apps

grossamit commented 2 years ago

having the same "stopping" behaviour on sagemaker notebook instance for almost 20hrs now :-( Any way to revive it?

madhulika189 commented 1 year ago

Hi, did anyone ever resolve this? I have the same issue. This happens intermittently. I have a try, except, else, finally code block, where finally has

finally:
    def get_notebook_name():
        log_path = '/opt/ml/metadata/resource-metadata.json'
        with open(log_path, 'r') as logs:
            _logs = json.load(logs)
        return _logs['ResourceName']
    client = boto3.client('sagemaker')
    client.stop_notebook_instance(NotebookInstanceName=get_notebook_name())

Most of the time, the notebook gets killed. But sometimes randomly, it wont get killed. When i look at the logs, it says

An error occurred (ValidationException) when calling the StopNotebookInstance operation: Status (Pending) not in ([InService])

I'm unable to find a resolution for this.