Closed bonnici closed 1 year ago
Having the exact same issue on us-east-1. Completely disables my ability to train models in a reasonable time. See below for work around.
AWS has been super helpful in diagnosing and solving so far.
Using fastai's start and create scripts from the notebook lifecycle configuration tab in AWS console to get it working again.
Getting past line 17 of the start
script takes 12 minutes on a p3.xl, the total start
script took 17 minutes. This exceeds the 5 minute timeout.
create
script below. Its exactly the script from fastai's lifecycle config. The create script runs in a few seconds.start
script below in to the jupyter terminal. It will take ~20 minutes, most of that time is Anaconda solving the environment. The last step kills the terminal.create script:
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3
echo "Finished running onCreate script"
start script part 1
sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user
# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch
echo "Update fastai library"
conda install -y fastai -c fastai
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user
echo "Restarting jupyter notebook server"
# Kills the jupyter terminal, requires you refresh the page
pkill -f jupyter-notebook
start script part 2
echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull
echo "Finished running onStart script"
AWS said they'll send me scripts tonight that will persist the changes by making a new notebook instance then creating a new conda environment stored in the /SageMaker/ folder. This will then persist changes, you'd just need to run the scripts once on startup. It doesnt solve how nice the CloudFormation method is though.
@bonnici, @mattmcclean
Thanks so much for the workaround, I'll give that a go tonight and hopefully I can keep going on the course.
Tested this by setting up and running the lesson1-pets.ipynb up through the first .save()
1) Create a new notebook instance. Choose the necessary settings per your desire(GPU instance, 50GB storage etc). Once created, open Jupyter
2) Upload the shell script setupKernel.sh (below). Open a Jupyter Terminal(New -> Terminal), run this:
cd SageMaker
chmod +x ./setupKernel.sh
setupKernel.sh
#!/bin/bash
echo "Creating new kernel"
conda create -y --prefix /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai python=3.6 ipykernel
source activate /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "Creating .fastai and .torch folders"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "Update fastai library"
conda install -y fastai -c fastai
echo "Update torchvision library"
conda install -y pytorch-gpu torchvision -c anaconda
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3
echo "Finished running onStart script"
This takes roughly 15 or so minutes. Maybe less.
3) This should create a new kernel for you, called sm-fastai, with the necessary libraries installed(including fastai). This will also close the fastai github 4) Open one of the example notebooks from fastai/create new notebook. Under Kernel -> Change Kernel, you should be able to locate conda_sm-fastai. Try using it to make sure everything is good.
5) Stop the notebook instance, create a new lifecycle configuration policy and add the following to its OnStart portion
#!/bin/bash
set -e
sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai
# Create symlink to kernel
ln -s /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai /home/ec2-user/anaconda3/envs/sm-fastai
EOF
Make sure #!/bin/bash is on Line1. It cannot be on line2 and it cannot have any spaces around it. If you copy and paste this on windows, make sure you're using Unix style line endings. Using Notepad ++ go to edit->EOF Conversion->Unix to convert
6) Attach this lifecycle policy to your notebook instance and start it. You should now be able to see your kernel as normal, going forward. The start times are much faster and the policy will never fail due to a 5 minute timeout since all the work has already been done and is permanently stored under the /home/ec2-user/SageMaker/ directory
For any brand new instances created after this, you need to follow these steps once for each new instance, but never again after(can just stop/start as normal then).
Thanks - seems to be working. I did run into that line ending issue on step 1 but copying and pasting it into an editor in the terminal worked. I'm planning to run through the lesson 3 notebooks today so if those all work I think we're in business.
If it's all good I might see if I can copy all this stuff into a lifecycle configuration and make a new CloudFormation template.
Edit: Actually I started getting issues with not being able to select the kernel. What I ended up just doing was commenting out the line:
conda install -y fastai -c fastai
in my original notebook's startup lifecycle configuration, then just running that in a new terminal after I started up the notebook. That way I also get to keep all my old data files etc.
Need to update sagemaker-cfg.yml i created the following fix but i am unable to push it as a fix
--- a/docs/setup/sagemaker-cfn.yml +++ b/docs/setup/sagemaker-cfn.yml @@ -77,7 +77,7 @@ Resources:
echo "Update fastai library"
- conda install -y fastai -c fastai + nohup /home/ec2-user/anaconda3/bin/conda install -y fastai -c fastai -v &
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
My CloudFormation stack has not been starting up successfully since Sunday. When I try to start the SageMaker Notebook, it sits in Pending for a while then goes to Failed. The error message is:
Looking at the logs, I just see these log lines:
I've also tried to start up a couple of new stacks in a few different regions using the templates linked in the course here but I get the same error. I'm pretty new to fastai so I'm not really sure what the problem is, apologies if this is the wrong place to report the issue.