fastai / course-v3

The 3rd edition of course.fast.ai
https://course.fast.ai/
Apache License 2.0
4.91k stars 3.57k forks source link

SageMaker CloudFormation stack won't start #518

Closed bonnici closed 1 year ago

bonnici commented 4 years ago

My CloudFormation stack has not been starting up successfully since Sunday. When I try to start the SageMaker Notebook, it sits in Pending for a while then goes to Failed. The error message is:

Notebook Instance Lifecycle Config 'arn:aws:sagemaker:ap-southeast-2:xxxxxxx:notebook-instance-lifecycle-config/fastainblifecycleconfig-xxxxx' for Notebook Instance 'arn:aws:sagemaker:ap-southeast-2:xxxxxxx:notebook-instance/fastai' took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet access. |  
-- | --

Looking at the logs, I just see these log lines:

Creating symlinks
Install a new kernel for fastai with name 'Python 3'
Installed kernelspec fastai in /home/ec2-user/.local/share/jupyter/kernels/fastai
Update fastai library

I've also tried to start up a couple of new stacks in a few different regions using the templates linked in the course here but I get the same error. I'm pretty new to fastai so I'm not really sure what the problem is, apologies if this is the wrong place to report the issue.

CaseGuide commented 4 years ago

Having the exact same issue on us-east-1. Completely disables my ability to train models in a reasonable time. See below for work around.

CaseGuide commented 4 years ago

AWS has been super helpful in diagnosing and solving so far.

Using fastai's start and create scripts from the notebook lifecycle configuration tab in AWS console to get it working again.

Issue Cause

Getting past line 17 of the start script takes 12 minutes on a p3.xl, the total start script took 17 minutes. This exceeds the 5 minute timeout.

Quick and dirty to get working again (this MIGHT not persist all the installs and MIGHT need to be done every time you start)

create script:

sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3

echo "Finished running onCreate script"

start script part 1

sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch

echo "Update fastai library"
conda install -y fastai -c fastai

echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user

echo "Restarting jupyter notebook server"
# Kills the jupyter terminal, requires you refresh the page
pkill -f jupyter-notebook

start script part 2

echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull

echo "Finished running onStart script"

AWS said they'll send me scripts tonight that will persist the changes by making a new notebook instance then creating a new conda environment stored in the /SageMaker/ folder. This will then persist changes, you'd just need to run the scripts once on startup. It doesnt solve how nice the CloudFormation method is though.

@bonnici, @mattmcclean

bonnici commented 4 years ago

Thanks so much for the workaround, I'll give that a go tonight and hopefully I can keep going on the course.

CaseGuide commented 4 years ago

Steps for what appears to be a permanent fix, though not thoroughly tested.

Tested this by setting up and running the lesson1-pets.ipynb up through the first .save()

1) Create a new notebook instance. Choose the necessary settings per your desire(GPU instance, 50GB storage etc). Once created, open Jupyter

2) Upload the shell script setupKernel.sh (below). Open a Jupyter Terminal(New -> Terminal), run this:

cd SageMaker
chmod +x ./setupKernel.sh

setupKernel.sh

#!/bin/bash

echo "Creating new kernel"
conda create -y --prefix /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai python=3.6 ipykernel 
source activate /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai

echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Creating .fastai and .torch folders"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Update fastai library"
conda install -y fastai -c fastai

echo "Update torchvision library"
conda install -y pytorch-gpu torchvision -c anaconda

echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3

echo "Finished running onStart script"

This takes roughly 15 or so minutes. Maybe less.

3) This should create a new kernel for you, called sm-fastai, with the necessary libraries installed(including fastai). This will also close the fastai github 4) Open one of the example notebooks from fastai/create new notebook. Under Kernel -> Change Kernel, you should be able to locate conda_sm-fastai. Try using it to make sure everything is good.

5) Stop the notebook instance, create a new lifecycle configuration policy and add the following to its OnStart portion

#!/bin/bash

set -e

sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai

# Create symlink to kernel
ln -s /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai /home/ec2-user/anaconda3/envs/sm-fastai

EOF

Make sure #!/bin/bash is on Line1. It cannot be on line2 and it cannot have any spaces around it. If you copy and paste this on windows, make sure you're using Unix style line endings. Using Notepad ++ go to edit->EOF Conversion->Unix to convert

6) Attach this lifecycle policy to your notebook instance and start it. You should now be able to see your kernel as normal, going forward. The start times are much faster and the policy will never fail due to a 5 minute timeout since all the work has already been done and is permanently stored under the /home/ec2-user/SageMaker/ directory

For any brand new instances created after this, you need to follow these steps once for each new instance, but never again after(can just stop/start as normal then).

bonnici commented 4 years ago

Thanks - seems to be working. I did run into that line ending issue on step 1 but copying and pasting it into an editor in the terminal worked. I'm planning to run through the lesson 3 notebooks today so if those all work I think we're in business.

If it's all good I might see if I can copy all this stuff into a lifecycle configuration and make a new CloudFormation template.

Edit: Actually I started getting issues with not being able to select the kernel. What I ended up just doing was commenting out the line:

conda install -y fastai -c fastai

in my original notebook's startup lifecycle configuration, then just running that in a new terminal after I started up the notebook. That way I also get to keep all my old data files etc.

adielsa commented 4 years ago

Need to update sagemaker-cfg.yml i created the following fix but i am unable to push it as a fix

--- a/docs/setup/sagemaker-cfn.yml +++ b/docs/setup/sagemaker-cfn.yml @@ -77,7 +77,7 @@ Resources:

conda install -y pytorch torchvision -c pytorch

           echo "Update fastai library"

- conda install -y fastai -c fastai + nohup /home/ec2-user/anaconda3/bin/conda install -y fastai -c fastai -v &

           echo "Install jupyter nbextension"
           source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv

sagemaker-cfn.yml.zip