Open uwaisiqbal opened 4 years ago
Why does Sagemaker pass a train argument by default to the bash command?
This is SageMaker's contract: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html. Unfortunately, there's not a way to change that.
What you can do is create a script that responds to train
and executes the bash command that you need for running your script.
Thanks for the help. I added an alias within my cli tool to respond to the train
argument. But I'm running into another problem. I've modified the enrtypoint in my Dockerfile as follows:
ENTRYPOINT ["kedro"]
However the hyperparameters I have specified when creating the Estimator
are not passed down to the command. It seems as though the hyperparameters are passed down only if you are specifying a python script with SAGEMAKER_PROGRAM
.
The hyperparameters should be available. The SageMaker service makes these available in a hyperparameters.json file, and if you've utilized the sagemaker-training-toolkit, read in and made available as environment variables to your script/entry point.
Following the call path of estimator.fit()
:
fit()
is invoked, a call to start a new training job is made.dict[str, str]
, added to the train_args
and the train
method is invoked on the SageMaker session instance.create_training_job
call is made to the SageMaker API.On the SageMaker service side, the breadcrumbs lead us through the docs:
hyperparameters.json
file available, which contains the hyperparameters passed in the CreateTrainingJob request. Of particular interest, referenced in this doc, is a link to the sagemaker-containers code (deprecated, see sagemaker-training-toolkit instead).In the sagemaker-training-toolkit code:
train
method in the toolkit computes several arguments before invoking run
on the entry_point
. Of note, it invokes the to_env_vars()
method on the Environment
instance. Comments in the module mention what SageMaker has done with the hyperparameters.env.to_env_vars()
method converts them to a dictionary of env vars.entry_point
run
method takes the env_vars
dict it was passed and writes them to environment variables which are available to the script/entry point, as mentioned here.@metrizable thanks for the explanation. I have to remark that the documentation isn't the clearest and there really isn't an example demonstrating this functionality with a bash entry point. I've had a read through the code and have a better understanding of how sagemaker is working under the hood.
I've setup another minimal example to test the functionality and it isn't behaving as expected.
I have set the entrypoint to the echo
command in my Dockerfile as follow:
ENTRYPOINT ["echo"]
I setup my Estimator
with hyperparameters
as follows:
hyperparameters = {'test': 10, 'a': 50, 'b': 'some text'}
estimator = Estimator(
image_name=image,
role=iam_role,
output_path=f"s3://{aws_params['SCW_S3_BUCKET']}/sagemaker/output/",
train_instance_count=instance_count,
input_mode='File',
train_instance_type='local',
tags=TAGS,
subnets=aws_params['VPC_SUBNETS'],
security_group_ids=aws_params['VPC_SGS'],
output_kms_key=aws_params['SCW_KMS_KEY'],
hyperparameters=hyperparams
)
estimator.fit()
Then I build my docker container and run the estimator and the output is the following:
2020-07-03 11:42:09,057 - sagemaker.local.image - INFO - docker command: docker-compose -f /private/var/folders/hb/qlcnb3ps2gz4v75__n9jws_40000gp/T/tmp4hke8hn_/docker-compose.yaml up --build --abort-on-container-exit
Creating tmp4hke8hn__algo-1-9jeh2_1 ... done
Attaching to tmp4hke8hn__algo-1-9jeh2_1
algo-1-9jeh2_1 | train
tmp4hke8hn__algo-1-9jeh2_1 exited with code 0
Aborting on container exit...
2020-07-03 11:42:10,771 - sagemaker - WARNING - 'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
===== Job Complete =====
I was expecting the hyperparameters to be printed to the terminal using the echo command but it just prints the train
command. Unless I'm not understanding how the sagemaker training toolkit is working, the hyperparameters should also be printed as here.
However, if I modify my Dockerfile and set the SAGEMAKER_PROGRAM
to a test.sh
script things work as expected:
ENV SAGEMAKER_PROGRAM test.sh
where test.sh
simply echos the arguments to the terminal:
#!/usr/bin/env bash
echo "Inside test script"
for i; do
echo $i
done
I build my container again and run the Estimator
to get the following output:
2020-07-03 11:11:42,361 - sagemaker.local.image - INFO - docker command: docker-compose -f /private/var/folders/hb/qlcnb3ps2gz4v75__n9jws_40000gp/T/tmpcfh9sq30/docker-compose.yaml up --build --abort-on-container-exit
Creating tmpcfh9sq30_algo-1-8evcu_1 ... done
Attaching to tmpcfh9sq30_algo-1-8evcu_1
algo-1-8evcu_1 | 2020-07-03 10:11:43,914 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1 | 2020-07-03 10:11:43,926 sagemaker-training-toolkit INFO Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1 | Returning the value itself
algo-1-8evcu_1 | 2020-07-03 10:11:43,951 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1 | 2020-07-03 10:11:43,975 sagemaker-training-toolkit INFO Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1 | Returning the value itself
algo-1-8evcu_1 | 2020-07-03 10:11:43,998 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1 | 2020-07-03 10:11:44,025 sagemaker-training-toolkit INFO Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1 | Returning the value itself
algo-1-8evcu_1 | 2020-07-03 10:11:44,048 sagemaker-training-toolkit INFO Invoking user script
algo-1-8evcu_1 |
algo-1-8evcu_1 | Training Env:
algo-1-8evcu_1 |
algo-1-8evcu_1 | {
algo-1-8evcu_1 | "additional_framework_parameters": {},
algo-1-8evcu_1 | "channel_input_dirs": {},
algo-1-8evcu_1 | "current_host": "algo-1-8evcu",
algo-1-8evcu_1 | "framework_module": null,
algo-1-8evcu_1 | "hosts": [
algo-1-8evcu_1 | "algo-1-8evcu"
algo-1-8evcu_1 | ],
algo-1-8evcu_1 | "hyperparameters": {
algo-1-8evcu_1 | "test": 10,
algo-1-8evcu_1 | "a": 50,
algo-1-8evcu_1 | "b": "some text"
algo-1-8evcu_1 | },
algo-1-8evcu_1 | "input_config_dir": "/opt/ml/input/config",
algo-1-8evcu_1 | "input_data_config": {},
algo-1-8evcu_1 | "input_dir": "/opt/ml/input",
algo-1-8evcu_1 | "is_master": true,
algo-1-8evcu_1 | "job_name": job_name,
algo-1-8evcu_1 | "log_level": 20,
algo-1-8evcu_1 | "master_hostname": "algo-1-8evcu",
algo-1-8evcu_1 | "model_dir": "/opt/ml/model",
algo-1-8evcu_1 | "module_dir": "/opt/ml/code",
algo-1-8evcu_1 | "module_name": "test.sh",
algo-1-8evcu_1 | "network_interface_name": "eth0",
algo-1-8evcu_1 | "num_cpus": 2,
algo-1-8evcu_1 | "num_gpus": 0,
algo-1-8evcu_1 | "output_data_dir": "/opt/ml/output/data",
algo-1-8evcu_1 | "output_dir": "/opt/ml/output",
algo-1-8evcu_1 | "output_intermediate_dir": "/opt/ml/output/intermediate",
algo-1-8evcu_1 | "resource_config": {
algo-1-8evcu_1 | "current_host": "algo-1-8evcu",
algo-1-8evcu_1 | "hosts": [
algo-1-8evcu_1 | "algo-1-8evcu"
algo-1-8evcu_1 | ]
algo-1-8evcu_1 | },
algo-1-8evcu_1 | "user_entry_point": "test.sh"
algo-1-8evcu_1 | }
algo-1-8evcu_1 |
algo-1-8evcu_1 | Environment variables:
algo-1-8evcu_1 |
algo-1-8evcu_1 | SM_HOSTS=["algo-1-8evcu"]
algo-1-8evcu_1 | SM_NETWORK_INTERFACE_NAME=eth0
algo-1-8evcu_1 | SM_HPS={"a":50,"b":"some text","test":10}
algo-1-8evcu_1 | SM_USER_ENTRY_POINT=test.sh
algo-1-8evcu_1 | SM_FRAMEWORK_PARAMS={}
algo-1-8evcu_1 | SM_RESOURCE_CONFIG={"current_host":"algo-1-8evcu","hosts":["algo-1-8evcu"]}
algo-1-8evcu_1 | SM_INPUT_DATA_CONFIG={}
algo-1-8evcu_1 | SM_OUTPUT_DATA_DIR=/opt/ml/output/data
algo-1-8evcu_1 | SM_CHANNELS=[]
algo-1-8evcu_1 | SM_CURRENT_HOST=algo-1-8evcu
algo-1-8evcu_1 | SM_MODULE_NAME=test.sh
algo-1-8evcu_1 | SM_LOG_LEVEL=20
algo-1-8evcu_1 | SM_FRAMEWORK_MODULE=
algo-1-8evcu_1 | SM_INPUT_DIR=/opt/ml/input
algo-1-8evcu_1 | SM_INPUT_CONFIG_DIR=/opt/ml/input/config
algo-1-8evcu_1 | SM_OUTPUT_DIR=/opt/ml/output
algo-1-8evcu_1 | SM_NUM_CPUS=2
algo-1-8evcu_1 | SM_NUM_GPUS=0
algo-1-8evcu_1 | SM_MODEL_DIR=/opt/ml/model
algo-1-8evcu_1 | SM_MODULE_DIR=/opt/ml/code
algo-1-8evcu_1 | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1-8evcu","framework_module":null,"hosts":["algo-1-8evcu"],"hyperparameters":{"a":50,"b":"some text","test":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"a204311-kedro-sagemaker-example-2020-07-03-11-11-42-11S","log_level":20,"master_hostname":"algo-1-8evcu","model_dir":"/opt/ml/model","module_dir":"/opt/ml/code","module_name":"test.sh","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-8evcu","hosts":["algo-1-8evcu"]},"user_entry_point":"test.sh"}
algo-1-8evcu_1 | SM_USER_ARGS=["-a","50","-b","some text","--test","10"]
algo-1-8evcu_1 | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
algo-1-8evcu_1 | SM_HP_TEST=10
algo-1-8evcu_1 | SM_HP_A=50
algo-1-8evcu_1 | SM_HP_B=some text
algo-1-8evcu_1 | PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages
algo-1-8evcu_1 |
algo-1-8evcu_1 | Invoking script with the following command:
algo-1-8evcu_1 |
algo-1-8evcu_1 | /bin/sh -c ./test.sh -a 50 -b 'some text' --test 10
algo-1-8evcu_1 |
algo-1-8evcu_1 |
algo-1-8evcu_1 | Inside test script
algo-1-8evcu_1 | -a
algo-1-8evcu_1 | 50
algo-1-8evcu_1 | -b
algo-1-8evcu_1 | some text
algo-1-8evcu_1 | --test
algo-1-8evcu_1 | 10
algo-1-8evcu_1 | 2020-07-03 10:11:44,067 sagemaker-training-toolkit INFO Reporting training SUCCESS
tmpcfh9sq30_algo-1-8evcu_1 exited with code 0
Aborting on container exit...
2020-07-03 11:11:44,301 - sagemaker - WARNING - 'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
===== Job Complete =====
It seems like when using ENTRYPOINT
vs the SAGEMAKER_PROGRAM
there is some difference in functionality. I'm not familiar enough with the sagemaker codebase to find where the fork in behavoiur happens but it seems like the entry_point
(https://github.com/aws/sagemaker-training-toolkit/blob/v3.6.0/src/sagemaker_training/entry_point.py#L44) function isn't called when an ENTRYPOINT
is defined in the Dockerfile.
You get train
printed out because this is how the container is launched by SageMaker https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html:
docker run image train
For the hyperparameters those are by default available in /opt/ml/input/config/hyperparameters.json
file (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-hyperparameters).
sagemaker-training
library provides additional functionality: for example sets hyperparameters as an environment variable and calls provided training script/entry_point/SAGEMAKER_PROGRAM with hyperparameters as arguments. In the first example with ENTRYPOINT it ["echo"]
the ``sagemaker-training``` library hasn't been invoked on the container start.
As per entry_point
in the code https://github.com/aws/sagemaker-training-toolkit/blob/v3.6.0/src/sagemaker_training/entry_point.py#L44 it references to SAGEMAKER_PROGRAM
or user training script entry_point. It's called the same way in the python-sdk Framework Estimator.
Support for the user training script (entry_point) being passed to the container as a parameter instead of being built as part of the image is one of the main features of the sagemaker-training
library. This allows to easily iterate over the training script/module without rebuilding the image. Or for example to allow other people to use the same image with different training scripts.
I realize that this issue has been last updated over a year ago, but on the off chance that somebody else also stumbles here, I wanted to fill in a gap as to why the container works like it does even if no CMD or ENTRYPOINT is defined.
Like several people have pointed out, the container is invoked like docker run <image> train
. The missing link is why this actually works, and it took at least me some time to figure it out.
When the container setup installs sagemaker-training
via pip
, the setup.py file in the repository root is also used to determine how to install the package. Buried near the bottom is the magic:
This essentially creates a shim executable at /usr/local/bin/train
(exact location may vary, but in $PATH nonetheless), with the contents of
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from sagemaker_training.cli.train import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(main())
So, when the container is defined without an ENTRYPOINT and is invoked with a single argument train
, the function sagemaker_training.cli.train.main()
is invoked. That function then calls trainer.train()
and we end up at the start of the last list mentioned in the answer https://github.com/aws/sagemaker-training-toolkit/issues/65#issuecomment-653386396
@tvoipio Thanks for your comment, and I'm sure lots of poor people will eventually stumble here given the comically dreadful state of sagemaker documentation.
Can I ask for clarification on how I currently understand the situation as well as a question about moving forward:
There are two options for training a container
docker run <image_name> train
overriding the CMD in the Dockerfile if specifiedsagemaker-training
method where, if installed in the Dockerfile, the container then does not pass train
when invoked, instead, it passes a set of environment variables that the user defines themselves. This essentially means you could define a new env var of mode='train'
which your code could look for to trigger certain logicThe Framework estimator is made to run with the sagemaker-training
method
Given the point about Framework estimators, if you had successfully trained a Framework container, if you wanted to create a transformer from it to do batch inference, how would you run the .transform() method? Would this then not pass serve
as an env var? I assume the answer lies awfully documented deep within an issue thread on the sagemaker-inference
library somewhere....
There is some excellent research by various commenters here which provided great insights into the inner workings of sagemaker training package. One only wishes it was not this convoluted. Here are some of my findings:
Sagemaker does allow you to essentially run a plain vanilla arbitrary script file as a training job without needing Sagemaker training package. See the note here. All you need to do is to provide the
"AlgorithmSpecification": {
"ContainerEntrypoint": ["string"],
"ContainerArguments": ["arg1", "arg2"],
...
}
args and it should work with your native script file as expected.
But wait a minute, How do I do this through the Sagemaker SDK Estimator
class? Guess what the Estimator
class even has an entry_point
attribute! What happens if I use that? Will it magically get converted to the ContainerEntrypoint
arg in the training job API call?
NO !
The entry_point
set in the Estimator
is fashioned into a script
var here which is then later on set as the SAGEMAKER_PROGRAM
here
So the end effect is there is NO way to express ContainerEntrypoint
via the Sagemaker SDK. It is meant to be interpreted by the sagemaker-training
package on the receiving end.
This leads to the follow pernicious dependency: if you like to use sagemaker SDK, you have to use sagemaker-training package. Which means you cannot override it with your custom entry point.
There are other issues. sagemaker training package does not work in Sagemaker studio because the Python kernel in Studio does not have gcc
installed. It will also not work in Windows systems due to this issue.
Now what happens is that when you are working inside your python project, you are forced to include sagemaker-training
package as a dependency so that you can add it to your installs when building your custom container. But because you need the other libs in your project for development (pandas, numpy etc) and you are forced to include sagemaker-training (so that it can support your use of the sagemaker SDK), your local development environment is broken (at least in windows)!
To fix the above, what I recommend is to remove sagemaker-training
as a dependency from your project (pyproject.toml
say) and when do a separate pip install in Dockerfile
Moral of the story: You cannot use Sagemaker SDK if you cannot use sagemaker training.
And a final caveat (whew!). If you decide to go down the path of implementing your own custom entry point as @uwaisiqbal has done, local
mode testing will break too since Sagemaker SDK forces the same train
command in local mode which your custom entrypoint will not expect and hence break.
I would love for some AWS expert to confirm (or push back) on my findings
Describe the bug I would like to create a SageMaker Training Job using a custom Docker container which executes a bash command I have created. I am using the kedro framework to organise and structure my code into pipelines and nodes. I would like to execute my training code with the bash command
For some reason, Sagemaker passes
train
as a default execution parameter.To reproduce The following is my Dockerfile:
I am creating and running a sagemaker job with the following code:
When execute the estimator.fit() I get the following error:
Why does Sagemaker pass a train argument by default to the bash command?
Expected behavior I would like expect the sagemaker job to execute the following bash command within the job: