Closed yifeim closed 6 years ago
hi @yifeim, thanks for your feedback! Supporting newer versions of deep learning frameworks and offering features across all of our deep learning images are things we're always aiming to do, and we're always re-evaluating our backlog based on customer feedback.
Hi @laurenyu, Thanks for the quick comments. Let me rephrase my question:
What is the easiest way to get the latest mxnet in a custom docker image for sagemaker?
Let's pretend that the module I want to install is mxnet-cu90mkl==1.3.0b20180625
.
No extensive tests necessary; as long as the system works out for now.
Btw, I tried subprocess.check_call([sys.executable, '-m', 'pip', 'uninstall', 'mxnet-cu90', '-y'])
followed by subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-U', 'mxnet-cu90mkl==1.3.0b20180625'])
The uninstall threw ImportError: No module named 'mxnet.callback'
. I do not think my hack was successful.
Probably the easiest (and fastest) way right now would be to create your own docker image with the right mxnet version and all the dependencies.
Here is our dockerfile example: https://github.com/aws/sagemaker-mxnet-containers/blob/master/docker/1.1.0/final/Dockerfile.gpu
@nadiaya Thanks! I guess I will be more explorative in lower level details. I would love to leave this open for a bit until I get the desired environment, if you would not mind.
Definitely, as Lauren mentioned supporting newer framework versions is something we aim to.
@nadiaya Got the solution while reading between the lines:
# For building images of MXNet versions 1.1 and above
docker build -t preprod-mxnet:1.1.0-cpu-py2 --build-arg py_version=2
--build-arg framework_installable=mxnet-1.1.0-py2.py3-none-manylinux1_x86_64.whl -f Dockerfile.cpu .
However, when I try to run test/integ, I get 403 error when downloading source.tar.gz at the beginning of a training job. It also happens after I do a docker system prune
in local mode. Is this a common issue?
Does your aws account has proper access to the s3 bucket? Can you download source.tar.gz directly, for example by using aws cli?
Can you, pease, post logs with the full error message?
Yes, I can download from s3 directly. The built image is actually fine when I use it. It is just the integ test that somehow fails. In the following logs, I removed the actual account id. The other failure is very similar -- when downloading the source codes.
test/integ/test_default_model_fn.py::test_default_model_fn PASSED [ 16%]
test/integ/test_gluon_hosting.py::test_gluon_hosting PASSED [ 33%]
test/integ/test_hosting.py::test_hosting PASSED [ 50%]
test/integ/test_linear_regression.py::test_linear_regression FAILED [ 66%]
test/integ/test_py_version.py::test_train_py_version FAILED [ 83%]
test/integ/test_py_version.py::test_hosting_py_version PASSED [100%]
======================================================== FAILURES =========================================================
_________________________________________________ test_linear_regression __________________________________________________
docker_image = 'mxnet-mkl-1.3.0b20180625-py3:latest'
sagemaker_session = <sagemaker.session.Session object at 0x7f595f2f0978>, opt_ml = '/tmp/tmpqmgj8ban', processor = 'cpu'
def test_linear_regression(docker_image, sagemaker_session, opt_ml, processor):
resource_path = 'test/resources/linear_regression'
# create training data
train_data = np.random.uniform(0, 1, [1000, 2])
train_label = np.array([train_data[i][0] + 2 * train_data[i][1] for i in range(1000)])
# eval data... repeat so there's enough to cover multicpu/gpu contexts
eval_data = np.array([[7, 2], [6, 10], [12, 2]]).repeat(32, 0)
eval_label = np.array([11, 26, 16]).repeat(32, 0)
# save training data
for path in ['training', 'evaluation']:
os.makedirs(os.path.join(opt_ml, 'input', 'data', path))
np.savetxt(os.path.join(opt_ml, 'input/data/training/train_data.txt.gz'), train_data)
np.savetxt(os.path.join(opt_ml, 'input/data/training/train_label.txt.gz'), train_label)
np.savetxt(os.path.join(opt_ml, 'input/data/evaluation/eval_data.txt.gz'), eval_data)
np.savetxt(os.path.join(opt_ml, 'input/data/evaluation/eval_label.txt.gz'), eval_label)
s3_source_archive = fw_utils.tar_and_upload_dir(session=sagemaker_session.boto_session,
bucket=sagemaker_session.default_bucket(),
s3_key_prefix=sagemaker_timestamp(),
script='linear_regression.py',
directory=resource_path)
utils.create_config_files('linear_regression.py', s3_source_archive.s3_prefix, opt_ml)
os.makedirs(os.path.join(opt_ml, 'model'))
> docker_utils.train(docker_image, opt_ml, processor)
test/integ/test_linear_regression.py:51:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/integ/docker_utils.py:46: in train
check_call(cmd)
test/integ/docker_utils.py:53: in check_call
subprocess.check_call(cmd, *popenargs, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
popenargs = (['docker', 'run', '--rm', '-h', 'algo-1', '-v', ...],), kwargs = {}, retcode = 1
cmd = ['docker', 'run', '--rm', '-h', 'algo-1', '-v', ...]
def check_call(*popenargs, **kwargs):
"""Run command with arguments. Wait for command to complete. If
the exit code was zero then return, otherwise raise
CalledProcessError. The CalledProcessError object will have the
return code in the returncode attribute.
The arguments are the same as for the call function. Example:
check_call(["ls", "-l"])
"""
retcode = call(*popenargs, **kwargs)
if retcode:
cmd = kwargs.get("args")
if cmd is None:
cmd = popenargs[0]
> raise CalledProcessError(retcode, cmd)
E subprocess.CalledProcessError: Command '['docker', 'run', '--rm', '-h', 'algo-1', '-v', '/tmp/tmpqmgj8ban:/opt/ml', '-e', 'AWS_ACCESS_KEY_ID', '-e', 'AWS_SECRET_ACCESS_KEY', '-e', 'AWS_SESSION_TOKEN', 'mxnet-mkl-1.3.0b20180625-py3:latest', 'train']' returned non-zero exit status 1.
../../anaconda3/envs/JupyterSystemEnv/lib/python3.6/subprocess.py:291: CalledProcessError
-------------------------------------------------- Captured stderr setup --------------------------------------------------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
--------------------------------------------------- Captured log setup ----------------------------------------------------
connectionpool.py 203 INFO Starting new HTTP connection (1): 169.254.169.254
connectionpool.py 203 INFO Starting new HTTP connection (1): 169.254.169.254
-------------------------------------------------- Captured stdout call ---------------------------------------------------
executing docker command: docker run --rm -h algo-1 -v /tmp/tmpqmgj8ban:/opt/ml -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN mxnet-mkl-1.3.0b20180625-py3:latest train
Downloading s3://sagemaker-us-west-2-{acct}/2018-06-29-08-09-02-183/sourcedir.tar.gz to /tmp/script.tar.gz
-------------------------------------------------- Captured stderr call ---------------------------------------------------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
2018-06-29 08:09:02,885 INFO - root - running container entrypoint
2018-06-29 08:09:02,885 INFO - root - starting train task
2018-06-29 08:09:02,889 INFO - container_support.training - Training starting
2018-06-29 08:09:03,491 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'output_dir': '/opt/ml/output', 'channel_dirs': {'evaluation': '/opt/ml/input/data/evaluation', 'training': '/opt/ml/input/data/training', 'Validation': '/opt/ml/input/data/Validation'}, 'base_dir': '/opt/ml', 'model_dir': '/opt/ml/model', '_ps_verbose': 0, 'available_cpus': 4, 'user_script_name': 'linear_regression.py', 'container_log_level': 20, 'input_config_dir': '/opt/ml/input/config', 'hyperparameters': {'sagemaker_region': 'us-west-2', 'sagemaker_container_log_level': 20, 'sagemaker_submit_directory': 's3://sagemaker-us-west-2-{acct}/2018-06-29-08-09-02-183/sourcedir.tar.gz', 'sagemaker_program': 'linear_regression.py'}, '_ps_port': 8000, '_scheduler_host': 'algo-1', 'code_dir': '/opt/ml/code', 'channels': {'evaluation': {'ContentType': 'evalContentType'}, 'training': {'ContentType': 'trainingContentType'}, 'Validation': {}}, 'sagemaker_region': 'us-west-2', 'user_script_archive': 's3://sagemaker-us-west-2-{acct}/2018-06-29-08-09-02-183/sourcedir.tar.gz','user_requirements_file': None, 'available_gpus': 0, 'output_data_dir': '/opt/ml/output/data/', 'hosts': ['algo-1'], 'current_host': 'algo-1', 'resource_config': {'hosts': ['algo-1'], 'current_host': 'algo-1'}, 'job_name': None, 'input_dir': '/opt/ml/input', '_scheduler_ip': '172.17.0.2'}
2018-06-29 08:09:03,506 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.169.254
2018-06-29 08:09:03,509 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.169.254
2018-06-29 08:09:03,546 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.amazonaws.com
2018-06-29 08:09:03,586 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-{acct}.s3.amazonaws.com
2018-06-29 08:09:03,603 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
2018-06-29 08:09:03,630 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
2018-06-29 08:09:03,662 ERROR - container_support.training - uncaught exception during training: An error occurred (403) when calling the HeadObject operation: Forbidden
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 36, in start
fw.train()
File "/usr/local/lib/python3.5/dist-packages/mxnet_container/train.py", line 169, in train
mxnet_env.download_user_module()
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 89, in download_user_module
cs.download_s3_resource(self.user_script_archive, tmp)
File "/usr/local/lib/python3.5/dist-packages/container_support/utils.py", line 37, in download_s3_resource
script_bucket.download_file(script_key_name, target)
File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 172, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/usr/local/lib/python3.5/dist-packages/boto3/s3/transfer.py", line 307, in download_file
future.result()
File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 73, in result
return self._coordinator.result()
File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 233, in result
raise self._exception
File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 255, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/s3transfer/download.py", line 353, in _submit
**transfer_future.meta.call_args.extra_args
File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 612, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
---------------------------------------------------- Captured log call ----------------------------------------------------
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
Hi @yifeim ,
How do you set the credential? Is that by exporting to environment variables? For this line in the error message, docker consumes credentials from env variables.
docker run --rm -h algo-1 -v /tmp/tmpqmgj8ban:/opt/ml -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN mxnet-mkl-1.3.0b20180625-py3:latest train
Hi @yangaws ,
We overcame this issue with using sagemaker bring-your-own docker images. It also provides better packaging since no custom codes need to be accessed during training time.
Unfortunately, I don't think we tried to dig further since the last update. It is understandable that accessing s3 buckets can be tricky at times, especially in nonstandard environments like docker or swf.
Suggest to close this ticket, but please feel free to reopen if there are additional questions that we may be able to answer.
Thanks for the many supports in the previous posts.
Hi,
I was wondering if the mxnet docker image could support a requirements.txt file just like what the tensorflow container does. The latest mxnet environments offer much performance improvements. Additionally, the latest cu92 environment fixes several memory leak bugs. Would love to use these latest features in my work.
Thanks.