aws / sagemaker-pytorch-training-toolkit

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
196 stars 86 forks source link

Worker initialization #115

Open romank87 opened 5 years ago

romank87 commented 5 years ago

It seems like there is a bug in initialization logic. Gunicorn processes are initialized not at container start but at a time of first request arrival. The global app variable here is not shared between gunicorn processes, so each process will be initialized only at a request arrival.

This will cause a random behavior. If the request comes to a worker that was already initialized - it will be processed quickly. If the request comes to a worker that is not yet initialized - the response will be delayed for quite some time (>30 sec in my case). This will even cause /ping requests to time out and inability to deploy a container to AWS.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

icywang86rui commented 5 years ago

@romank87 Thanks for the feedback we have an internal backlog item tracking this issue. We will keep you updated with the progress.

scottpletcher commented 5 years ago

Seeing this as well when trying to deploy a plain 1.1.0 container....fails health check and never completes deployment. Frustratingly enough, I'm trying to use this PyTorch container to complete an instructional video on how to submit custom models to the AWS Marketplace...

chuyang-deng commented 5 years ago

Hi @scottpletcher, I apologize for the inconvenience. We have assigned a dedicated engineer to work on this issue.

One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime.

Thanks for your patience!

nbeuchat commented 5 years ago

I have been referred to this thread by AWS support.

One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime. @ChuyangDeng Could you please give more information on how to do this? We are deploying our model through jupyter notebook on sagemaker.

from sagemaker.session import Session
from sagemaker.pytorch import PyTorchModel

model_data = Session().upload_data(path='model.tar.gz', key_prefix='model')

env = {
    "SAGEMAKER_REQUIREMENTS": "requirements.txt", # path relative to `source_dir` below.
}

model = PyTorchModel(model_data=model_data,
                     entry_point='generate.py',
                     role=role,
                     env=env,
                     source_dir='.',
                     name=endpoint_name,
                     framework_version='1.0.0')

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.large')

In the requirements.txt, we have pytorch_pretrained_bert.

I tried to remove pytorch_pretrained_bert out of the requirements file and use !pip install pytorch_pretrained_bert in the notebook but I am still not able to deploy. I receive the following error:

ValueError: Error hosting endpoint bert-offer-type-multilang: Failed Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Note that I used a huge machine here as we had "No space left" issues in the logs (although initially, that very same model could be successfully deployed to a ml.t2.large instance).

ChoiByungWook commented 5 years ago

@nbeuchat,

Is the requirements.txt installation not working for you?

The notebook environment that you installed pytorch_pretrained_bert is different from where your model is hosted. You will need to do an explicit install in your generate.py file, which is what gets persisted over.

However, that script execution happens after the ping (during worker initialization of the first request), so @romank87 would still have the same issue. There doesn't seem to be a nice solution other than to modify the docker container to contain your dependency, if you need to get passed worker initialization.

You would have to either modify the container at runtime or build it yourself with a modified Dockerfile.

Modify at runtime

  1. Login to our ECR repo
    • $(aws ecr get-login --no-include-email --registry-id 520713654638)
  2. Pull down the PyTorch image from ECR
    • docker pull 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1.0-cpu-py3
  3. run your container with a bash session
    • docker run -it 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1.0-cpu-py3 bash
  4. install your dependencies
    • pip install blah blah blah
  5. In another bash session commit the running container as another image
    • docker commit --change='ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]' $(docker ps -q) sagemaker-pytorch-container:1.0

The command above assumes there is only one running container in your Docker session, otherwise you will need to replace docker ps -q with the right container id.

Build

  1. Modify the Dockerfile to install your dependencies
  2. Follow the instructions here: https://github.com/aws/sagemaker-pytorch-container#building-your-image

Testing your new image

  1. Change image_base constructor parameter in PyTorchModel to our new image (sagemaker-pytorch-container:1.0)
  2. Change instance_type to 'local'.
  3. Deploy

This will run the container on your local machine, which should iterate a lot quicker than waiting for instances to provision. When the container runs as expected, then we can push the image to an ECR repo and deploy in SageMaker.

I apologize the all of the inconvenience. I think to a certain extent being required to do either one of the options listed above defeats the purpose of these images existing, since we want it to be an abstraction.

Please let me know if there is anything I can clarify.

sivakhno commented 5 years ago

@ChoiByungWook - thanks for clarification above. This is exactly what we are trying to do

sagemaker_serving = PyTorchModel(model_data=merged_models_file_path, source_dir='./', image=image_name, role=os.environ['SAGEMAKER_ROLE'], \
        framework_version='1.0.0', entry_point='serving.py', predictor_cls=utils.JSONPredictor)

where image_name is our custom (sagemaker-pytorch extended) image. However as per https://github.com/aws/sagemaker-containers/blob/master/TRAINING_IN_DETAIL.rst it seems that

One difference between a Framework Container and a BYOC is  ... the former doesn't include the user entry point and needs to download it from S3

which in our case seem to lead to the creation of massive source.tar.gz file (I also could not find any documentation on how to control what goes into this file, it seem to contain snapshot of all files in the current directory and subdirectories). Can we set up PyTorchModel such that entrypoint in the docker image is used instead? Thanks!

icywang86rui commented 5 years ago

@sivakhno
The PyTorchModel will tar up everything under source_dir and that makes the source.tar.gz file. So if you would like to control what does into it, you can create a folder with only the relevant files in it and point source_dir to that folder. And entry_point should points to the entry point script you would like the container to execute. In the case serving.py assuming it's directly under source_dir.

Hope this answers your question. Please let us know if you have any further questions.