aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Bash Command ENTRYPOINT Expects `train` argument #65

Open uwaisiqbal opened 4 years ago

uwaisiqbal commented 4 years ago

Describe the bug I would like to create a SageMaker Training Job using a custom Docker container which executes a bash command I have created. I am using the kedro framework to organise and structure my code into pipelines and nodes. I would like to execute my training code with the bash command

kedro run --tag train_pipeline

For some reason, Sagemaker passes train as a default execution parameter.

To reproduce The following is my Dockerfile:

FROM python:3.7-stretch

# install project requirements
COPY src/requirements_sm.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && rm -f /tmp/requirements.txt

# install nltk dependencies
RUN python -m nltk.downloader punkt

# Copy the whole project except what is in .dockerignore
COPY . /opt/ml/code

# Set working directory
WORKDIR /opt/ml/code

ENTRYPOINT ["kedro", "run"]

I am creating and running a sagemaker job with the following code:

hyperparams = {
    'tag': 'train_pipeline',
}

estimator = Estimator(
    image_name=IMAGE_NAME,
    role=IAM_ROLE,
    train_instance_count=1,
    train_instance_type='local',
    tags=TAGS,
    subnets=SUBNETS,
    security_group_ids=SG_IDS,
    hyperparameters=hyperparams,
    output_kms_key=KMS_KEY,
    output_path=BUCKET_PATH
)

estimator.fit()

When execute the estimator.fit() I get the following error:

Creating tmpvixmgk5s_algo-1-p8xwk_1 ... done
Attaching to tmpvixmgk5s_algo-1-p8xwk_1
algo-1-p8xwk_1  | Usage: kedro run [OPTIONS]
algo-1-p8xwk_1  | Try 'kedro run -h' for help.
algo-1-p8xwk_1  | 
algo-1-p8xwk_1  | Error: Got unexpected extra argument (train)
tmpvixmgk5s_algo-1-p8xwk_1 exited with code 2

Why does Sagemaker pass a train argument by default to the bash command?

Expected behavior I would like expect the sagemaker job to execute the following bash command within the job:

kedro run --tag train_pipeline
laurenyu commented 4 years ago

Why does Sagemaker pass a train argument by default to the bash command?

This is SageMaker's contract: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html. Unfortunately, there's not a way to change that.

What you can do is create a script that responds to train and executes the bash command that you need for running your script.

uwaisiqbal commented 4 years ago

Thanks for the help. I added an alias within my cli tool to respond to the train argument. But I'm running into another problem. I've modified the enrtypoint in my Dockerfile as follows:

ENTRYPOINT ["kedro"]

However the hyperparameters I have specified when creating the Estimator are not passed down to the command. It seems as though the hyperparameters are passed down only if you are specifying a python script with SAGEMAKER_PROGRAM.

metrizable commented 4 years ago

tldr;

The hyperparameters should be available. The SageMaker service makes these available in a hyperparameters.json file, and if you've utilized the sagemaker-training-toolkit, read in and made available as environment variables to your script/entry point.

a deeper story

Following the call path of estimator.fit():

  1. When fit() is invoked, a call to start a new training job is made.
  2. In this call, the hyperparameters are formatted as a dict[str, str], added to the train_args and the train method is invoked on the SageMaker session instance.
  3. In the session instance, the hyperparameters are added to the request object, and a create_training_job call is made to the SageMaker API.

On the SageMaker service side, the breadcrumbs lead us through the docs:

  1. How Training Works
  2. Bring Your Own Model, Use Your Own Training Algorithms, How SageMaker Runs It
  3. Create Docker Container. Of note, SageMaker service makes the hyperparameters.json file available, which contains the hyperparameters passed in the CreateTrainingJob request. Of particular interest, referenced in this doc, is a link to the sagemaker-containers code (deprecated, see sagemaker-training-toolkit instead).

In the sagemaker-training-toolkit code:

  1. The train method in the toolkit computes several arguments before invoking run on the entry_point. Of note, it invokes the to_env_vars() method on the Environment instance. Comments in the module mention what SageMaker has done with the hyperparameters.
  2. The env.to_env_vars() method converts them to a dictionary of env vars.
  3. The entry_point run method takes the env_vars dict it was passed and writes them to environment variables which are available to the script/entry point, as mentioned here.
uwaisiqbal commented 4 years ago

@metrizable thanks for the explanation. I have to remark that the documentation isn't the clearest and there really isn't an example demonstrating this functionality with a bash entry point. I've had a read through the code and have a better understanding of how sagemaker is working under the hood.

I've setup another minimal example to test the functionality and it isn't behaving as expected.

I have set the entrypoint to the echo command in my Dockerfile as follow:

ENTRYPOINT ["echo"]

I setup my Estimator with hyperparameters as follows:

hyperparameters = {'test': 10,  'a': 50, 'b': 'some text'}
estimator = Estimator(
            image_name=image,
            role=iam_role,
            output_path=f"s3://{aws_params['SCW_S3_BUCKET']}/sagemaker/output/",
            train_instance_count=instance_count,
            input_mode='File',
            train_instance_type='local',
            tags=TAGS,
            subnets=aws_params['VPC_SUBNETS'],
            security_group_ids=aws_params['VPC_SGS'],
            output_kms_key=aws_params['SCW_KMS_KEY'],
            hyperparameters=hyperparams
        )

estimator.fit()

Then I build my docker container and run the estimator and the output is the following:

2020-07-03 11:42:09,057 - sagemaker.local.image - INFO - docker command: docker-compose -f /private/var/folders/hb/qlcnb3ps2gz4v75__n9jws_40000gp/T/tmp4hke8hn_/docker-compose.yaml up --build --abort-on-container-exit
Creating tmp4hke8hn__algo-1-9jeh2_1 ... done
Attaching to tmp4hke8hn__algo-1-9jeh2_1
algo-1-9jeh2_1  | train
tmp4hke8hn__algo-1-9jeh2_1 exited with code 0
Aborting on container exit...
2020-07-03 11:42:10,771 - sagemaker - WARNING - 'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
===== Job Complete =====

I was expecting the hyperparameters to be printed to the terminal using the echo command but it just prints the train command. Unless I'm not understanding how the sagemaker training toolkit is working, the hyperparameters should also be printed as here.

However, if I modify my Dockerfile and set the SAGEMAKER_PROGRAM to a test.sh script things work as expected:

ENV SAGEMAKER_PROGRAM test.sh

where test.sh simply echos the arguments to the terminal:

#!/usr/bin/env bash
echo "Inside test script"
for i; do
  echo $i
done

I build my container again and run the Estimator to get the following output:

2020-07-03 11:11:42,361 - sagemaker.local.image - INFO - docker command: docker-compose -f /private/var/folders/hb/qlcnb3ps2gz4v75__n9jws_40000gp/T/tmpcfh9sq30/docker-compose.yaml up --build --abort-on-container-exit
Creating tmpcfh9sq30_algo-1-8evcu_1 ... done
Attaching to tmpcfh9sq30_algo-1-8evcu_1
algo-1-8evcu_1  | 2020-07-03 10:11:43,914 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1  | 2020-07-03 10:11:43,926 sagemaker-training-toolkit INFO     Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1  | Returning the value itself
algo-1-8evcu_1  | 2020-07-03 10:11:43,951 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1  | 2020-07-03 10:11:43,975 sagemaker-training-toolkit INFO     Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1  | Returning the value itself
algo-1-8evcu_1  | 2020-07-03 10:11:43,998 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1  | 2020-07-03 10:11:44,025 sagemaker-training-toolkit INFO     Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1  | Returning the value itself
algo-1-8evcu_1  | 2020-07-03 10:11:44,048 sagemaker-training-toolkit INFO     Invoking user script
algo-1-8evcu_1  | 
algo-1-8evcu_1  | Training Env:
algo-1-8evcu_1  | 
algo-1-8evcu_1  | {
algo-1-8evcu_1  |     "additional_framework_parameters": {},
algo-1-8evcu_1  |     "channel_input_dirs": {},
algo-1-8evcu_1  |     "current_host": "algo-1-8evcu",
algo-1-8evcu_1  |     "framework_module": null,
algo-1-8evcu_1  |     "hosts": [
algo-1-8evcu_1  |         "algo-1-8evcu"
algo-1-8evcu_1  |     ],
algo-1-8evcu_1  |     "hyperparameters": {
algo-1-8evcu_1  |         "test": 10,
algo-1-8evcu_1  |         "a": 50,
algo-1-8evcu_1  |         "b": "some text"
algo-1-8evcu_1  |     },
algo-1-8evcu_1  |     "input_config_dir": "/opt/ml/input/config",
algo-1-8evcu_1  |     "input_data_config": {},
algo-1-8evcu_1  |     "input_dir": "/opt/ml/input",
algo-1-8evcu_1  |     "is_master": true,
algo-1-8evcu_1  |     "job_name": job_name,
algo-1-8evcu_1  |     "log_level": 20,
algo-1-8evcu_1  |     "master_hostname": "algo-1-8evcu",
algo-1-8evcu_1  |     "model_dir": "/opt/ml/model",
algo-1-8evcu_1  |     "module_dir": "/opt/ml/code",
algo-1-8evcu_1  |     "module_name": "test.sh",
algo-1-8evcu_1  |     "network_interface_name": "eth0",
algo-1-8evcu_1  |     "num_cpus": 2,
algo-1-8evcu_1  |     "num_gpus": 0,
algo-1-8evcu_1  |     "output_data_dir": "/opt/ml/output/data",
algo-1-8evcu_1  |     "output_dir": "/opt/ml/output",
algo-1-8evcu_1  |     "output_intermediate_dir": "/opt/ml/output/intermediate",
algo-1-8evcu_1  |     "resource_config": {
algo-1-8evcu_1  |         "current_host": "algo-1-8evcu",
algo-1-8evcu_1  |         "hosts": [
algo-1-8evcu_1  |             "algo-1-8evcu"
algo-1-8evcu_1  |         ]
algo-1-8evcu_1  |     },
algo-1-8evcu_1  |     "user_entry_point": "test.sh"
algo-1-8evcu_1  | }
algo-1-8evcu_1  | 
algo-1-8evcu_1  | Environment variables:
algo-1-8evcu_1  | 
algo-1-8evcu_1  | SM_HOSTS=["algo-1-8evcu"]
algo-1-8evcu_1  | SM_NETWORK_INTERFACE_NAME=eth0
algo-1-8evcu_1  | SM_HPS={"a":50,"b":"some text","test":10}
algo-1-8evcu_1  | SM_USER_ENTRY_POINT=test.sh
algo-1-8evcu_1  | SM_FRAMEWORK_PARAMS={}
algo-1-8evcu_1  | SM_RESOURCE_CONFIG={"current_host":"algo-1-8evcu","hosts":["algo-1-8evcu"]}
algo-1-8evcu_1  | SM_INPUT_DATA_CONFIG={}
algo-1-8evcu_1  | SM_OUTPUT_DATA_DIR=/opt/ml/output/data
algo-1-8evcu_1  | SM_CHANNELS=[]
algo-1-8evcu_1  | SM_CURRENT_HOST=algo-1-8evcu
algo-1-8evcu_1  | SM_MODULE_NAME=test.sh
algo-1-8evcu_1  | SM_LOG_LEVEL=20
algo-1-8evcu_1  | SM_FRAMEWORK_MODULE=
algo-1-8evcu_1  | SM_INPUT_DIR=/opt/ml/input
algo-1-8evcu_1  | SM_INPUT_CONFIG_DIR=/opt/ml/input/config
algo-1-8evcu_1  | SM_OUTPUT_DIR=/opt/ml/output
algo-1-8evcu_1  | SM_NUM_CPUS=2
algo-1-8evcu_1  | SM_NUM_GPUS=0
algo-1-8evcu_1  | SM_MODEL_DIR=/opt/ml/model
algo-1-8evcu_1  | SM_MODULE_DIR=/opt/ml/code
algo-1-8evcu_1  | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1-8evcu","framework_module":null,"hosts":["algo-1-8evcu"],"hyperparameters":{"a":50,"b":"some text","test":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"a204311-kedro-sagemaker-example-2020-07-03-11-11-42-11S","log_level":20,"master_hostname":"algo-1-8evcu","model_dir":"/opt/ml/model","module_dir":"/opt/ml/code","module_name":"test.sh","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-8evcu","hosts":["algo-1-8evcu"]},"user_entry_point":"test.sh"}
algo-1-8evcu_1  | SM_USER_ARGS=["-a","50","-b","some text","--test","10"]
algo-1-8evcu_1  | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
algo-1-8evcu_1  | SM_HP_TEST=10
algo-1-8evcu_1  | SM_HP_A=50
algo-1-8evcu_1  | SM_HP_B=some text
algo-1-8evcu_1  | PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages
algo-1-8evcu_1  | 
algo-1-8evcu_1  | Invoking script with the following command:
algo-1-8evcu_1  | 
algo-1-8evcu_1  | /bin/sh -c ./test.sh -a 50 -b 'some text' --test 10
algo-1-8evcu_1  | 
algo-1-8evcu_1  | 
algo-1-8evcu_1  | Inside test script
algo-1-8evcu_1  | -a
algo-1-8evcu_1  | 50
algo-1-8evcu_1  | -b
algo-1-8evcu_1  | some text
algo-1-8evcu_1  | --test
algo-1-8evcu_1  | 10
algo-1-8evcu_1  | 2020-07-03 10:11:44,067 sagemaker-training-toolkit INFO     Reporting training SUCCESS
tmpcfh9sq30_algo-1-8evcu_1 exited with code 0
Aborting on container exit...
2020-07-03 11:11:44,301 - sagemaker - WARNING - 'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
===== Job Complete =====

It seems like when using ENTRYPOINT vs the SAGEMAKER_PROGRAM there is some difference in functionality. I'm not familiar enough with the sagemaker codebase to find where the fork in behavoiur happens but it seems like the entry_point (https://github.com/aws/sagemaker-training-toolkit/blob/v3.6.0/src/sagemaker_training/entry_point.py#L44) function isn't called when an ENTRYPOINT is defined in the Dockerfile.

nadiaya commented 4 years ago

You get train printed out because this is how the container is launched by SageMaker https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html:

docker run image train

For the hyperparameters those are by default available in /opt/ml/input/config/hyperparameters.json file (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-hyperparameters).

sagemaker-training library provides additional functionality: for example sets hyperparameters as an environment variable and calls provided training script/entry_point/SAGEMAKER_PROGRAM with hyperparameters as arguments. In the first example with ENTRYPOINT it ["echo"] the ``sagemaker-training``` library hasn't been invoked on the container start.

As per entry_point in the code https://github.com/aws/sagemaker-training-toolkit/blob/v3.6.0/src/sagemaker_training/entry_point.py#L44 it references to SAGEMAKER_PROGRAM or user training script entry_point. It's called the same way in the python-sdk Framework Estimator. Support for the user training script (entry_point) being passed to the container as a parameter instead of being built as part of the image is one of the main features of the sagemaker-training library. This allows to easily iterate over the training script/module without rebuilding the image. Or for example to allow other people to use the same image with different training scripts.

tvoipio commented 3 years ago

I realize that this issue has been last updated over a year ago, but on the off chance that somebody else also stumbles here, I wanted to fill in a gap as to why the container works like it does even if no CMD or ENTRYPOINT is defined.

Like several people have pointed out, the container is invoked like docker run <image> train. The missing link is why this actually works, and it took at least me some time to figure it out.

When the container setup installs sagemaker-training via pip, the setup.py file in the repository root is also used to determine how to install the package. Buried near the bottom is the magic:

https://github.com/aws/sagemaker-training-toolkit/blob/447e8f32c108950b8778bfc31ab3f20174f04a38/setup.py#L92

This essentially creates a shim executable at /usr/local/bin/train (exact location may vary, but in $PATH nonetheless), with the contents of

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from sagemaker_training.cli.train import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

So, when the container is defined without an ENTRYPOINT and is invoked with a single argument train, the function sagemaker_training.cli.train.main() is invoked. That function then calls trainer.train() and we end up at the start of the last list mentioned in the answer https://github.com/aws/sagemaker-training-toolkit/issues/65#issuecomment-653386396

Dan-Treacher commented 3 years ago

@tvoipio Thanks for your comment, and I'm sure lots of poor people will eventually stumble here given the comically dreadful state of sagemaker documentation.

Can I ask for clarification on how I currently understand the situation as well as a question about moving forward:

There are two options for training a container

The Framework estimator is made to run with the sagemaker-training method

Given the point about Framework estimators, if you had successfully trained a Framework container, if you wanted to create a transformer from it to do batch inference, how would you run the .transform() method? Would this then not pass serve as an env var? I assume the answer lies awfully documented deep within an issue thread on the sagemaker-inference library somewhere....

kiyer-godaddy commented 1 year ago

There is some excellent research by various commenters here which provided great insights into the inner workings of sagemaker training package. One only wishes it was not this convoluted. Here are some of my findings:

Sagemaker does allow you to essentially run a plain vanilla arbitrary script file as a training job without needing Sagemaker training package. See the note here. All you need to do is to provide the

"AlgorithmSpecification": {
        "ContainerEntrypoint": ["string"],   
          "ContainerArguments": ["arg1", "arg2"],
        ...
}

args and it should work with your native script file as expected.

But wait a minute, How do I do this through the Sagemaker SDK Estimator class? Guess what the Estimator class even has an entry_point attribute! What happens if I use that? Will it magically get converted to the ContainerEntrypoint arg in the training job API call? NO ! The entry_point set in the Estimator is fashioned into a script var here which is then later on set as the SAGEMAKER_PROGRAM here

So the end effect is there is NO way to express ContainerEntrypoint via the Sagemaker SDK. It is meant to be interpreted by the sagemaker-training package on the receiving end. This leads to the follow pernicious dependency: if you like to use sagemaker SDK, you have to use sagemaker-training package. Which means you cannot override it with your custom entry point.

There are other issues. sagemaker training package does not work in Sagemaker studio because the Python kernel in Studio does not have gcc installed. It will also not work in Windows systems due to this issue. Now what happens is that when you are working inside your python project, you are forced to include sagemaker-training package as a dependency so that you can add it to your installs when building your custom container. But because you need the other libs in your project for development (pandas, numpy etc) and you are forced to include sagemaker-training (so that it can support your use of the sagemaker SDK), your local development environment is broken (at least in windows)!

To fix the above, what I recommend is to remove sagemaker-training as a dependency from your project (pyproject.toml say) and when do a separate pip install in Dockerfile

Moral of the story: You cannot use Sagemaker SDK if you cannot use sagemaker training.

And a final caveat (whew!). If you decide to go down the path of implementing your own custom entry point as @uwaisiqbal has done, local mode testing will break too since Sagemaker SDK forces the same train command in local mode which your custom entrypoint will not expect and hence break.

I would love for some AWS expert to confirm (or push back) on my findings