aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.05k stars 6.76k forks source link

Launching a Display in Sagemaker #469

Closed bhairavmehta95 closed 5 years ago

bhairavmehta95 commented 5 years ago

To do reinforcement learning, there are many times when you might need a display or virtualbuffer.

What's the best way to go about launch an Xvfb process when training with Sagemaker?

laurenyu commented 5 years ago

Thanks for your interest in SageMaker!

To launch an xvfb process, you'll need to extend one of our containers or build your own container. You'll want the entrypoint of the container to do something like:

xvfb-run --auto-servernum -s "-screen 0 1400x900x24" train

where train starts the container code responsible for training.

bhairavmehta95 commented 5 years ago

So I extended my own container, but I found the xvfb-run method to not work (I ended up launching Xvfb inside of my python file with subprocess) since I need to specify the SAGEMAKER_PROGRAM.

How do I use both SAGEMAKER_PROGRAM (which seems to expect a python file) and the traditional Docker ENTRYPOINT. Can I just specify both?

laurenyu commented 5 years ago

using subprocess to launch xvfb also works. The other option is to make a shell script that both starts the xvfb process and calls the Python script, and make that shell script the Docker entry point.

nadiaya commented 5 years ago

You can check out our pytorch container that does something similar to what you need: it has a shell script entrypoint: https://github.com/aws/sagemaker-pytorch-container/blob/master/docker/0.4.0/final/Dockerfile.cpu#L19 https://github.com/aws/sagemaker-pytorch-container/blob/master/lib/start_with_right_hostname.sh

If you do in the shell script:

if [ $1 == 'train' ]
then
   xvfb-run --auto-servernum -s "-screen 0 1400x900x24" train
else
   serve
fi

That would launch the training in a regular way but with a display available. And you should be able to use sagemaker_program

bhairavmehta95 commented 5 years ago

So I'm not sure subprocess (with atleast Xvfb) is always the way to go? It seemed that it was hanging the exit process (#470). The traditional way people run Xvfb in docker (w. a custom ENTRYPOINT) seems fine though.

I think it'd be nice to have a Reinforcement Learning tutorial inside of your Sagemaker tutorials list. I've been writing one for our own organization (Duckietown - Sagemaker is our cloud sponsor for a NIPS competition) and would be happy to contribute to the tutorials once I clean it up and make sure it works.