aws / sagemaker-mxnet-training-toolkit

Toolkit for running MXNet training scripts on SageMaker. Dockerfiles used for building SageMaker MXNet Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
60 stars 55 forks source link

Feature request: support latest mxnet package via requirements.txt? #26

Closed yifeim closed 6 years ago

yifeim commented 6 years ago

Hi,

I was wondering if the mxnet docker image could support a requirements.txt file just like what the tensorflow container does. The latest mxnet environments offer much performance improvements. Additionally, the latest cu92 environment fixes several memory leak bugs. Would love to use these latest features in my work.

Thanks.

laurenyu commented 6 years ago

hi @yifeim, thanks for your feedback! Supporting newer versions of deep learning frameworks and offering features across all of our deep learning images are things we're always aiming to do, and we're always re-evaluating our backlog based on customer feedback.

yifeim commented 6 years ago

Hi @laurenyu, Thanks for the quick comments. Let me rephrase my question:

What is the easiest way to get the latest mxnet in a custom docker image for sagemaker?

Let's pretend that the module I want to install is mxnet-cu90mkl==1.3.0b20180625.

No extensive tests necessary; as long as the system works out for now.

yifeim commented 6 years ago

Btw, I tried subprocess.check_call([sys.executable, '-m', 'pip', 'uninstall', 'mxnet-cu90', '-y']) followed by subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-U', 'mxnet-cu90mkl==1.3.0b20180625']) The uninstall threw ImportError: No module named 'mxnet.callback'. I do not think my hack was successful.

nadiaya commented 6 years ago

Probably the easiest (and fastest) way right now would be to create your own docker image with the right mxnet version and all the dependencies.

Here is our dockerfile example: https://github.com/aws/sagemaker-mxnet-containers/blob/master/docker/1.1.0/final/Dockerfile.gpu

yifeim commented 6 years ago

@nadiaya Thanks! I guess I will be more explorative in lower level details. I would love to leave this open for a bit until I get the desired environment, if you would not mind.

nadiaya commented 6 years ago

Definitely, as Lauren mentioned supporting newer framework versions is something we aim to.

yifeim commented 6 years ago

@nadiaya Got the solution while reading between the lines:

# For building images of MXNet versions 1.1 and above
docker build -t preprod-mxnet:1.1.0-cpu-py2 --build-arg py_version=2
--build-arg framework_installable=mxnet-1.1.0-py2.py3-none-manylinux1_x86_64.whl -f Dockerfile.cpu .

However, when I try to run test/integ, I get 403 error when downloading source.tar.gz at the beginning of a training job. It also happens after I do a docker system prune in local mode. Is this a common issue?

nadiaya commented 6 years ago

Does your aws account has proper access to the s3 bucket? Can you download source.tar.gz directly, for example by using aws cli?

Can you, pease, post logs with the full error message?

yifeim commented 6 years ago

Yes, I can download from s3 directly. The built image is actually fine when I use it. It is just the integ test that somehow fails. In the following logs, I removed the actual account id. The other failure is very similar -- when downloading the source codes.

test/integ/test_default_model_fn.py::test_default_model_fn PASSED                                                   [ 16%]
test/integ/test_gluon_hosting.py::test_gluon_hosting PASSED                                                         [ 33%]
test/integ/test_hosting.py::test_hosting PASSED                                                                     [ 50%]
test/integ/test_linear_regression.py::test_linear_regression FAILED                                                 [ 66%]
test/integ/test_py_version.py::test_train_py_version FAILED                                                         [ 83%]
test/integ/test_py_version.py::test_hosting_py_version PASSED                                                       [100%]

======================================================== FAILURES =========================================================
_________________________________________________ test_linear_regression __________________________________________________

docker_image = 'mxnet-mkl-1.3.0b20180625-py3:latest'
sagemaker_session = <sagemaker.session.Session object at 0x7f595f2f0978>, opt_ml = '/tmp/tmpqmgj8ban', processor = 'cpu'

    def test_linear_regression(docker_image, sagemaker_session, opt_ml, processor):
        resource_path = 'test/resources/linear_regression'

        # create training data
        train_data = np.random.uniform(0, 1, [1000, 2])
        train_label = np.array([train_data[i][0] + 2 * train_data[i][1] for i in range(1000)])

        # eval data... repeat so there's enough to cover multicpu/gpu contexts
        eval_data = np.array([[7, 2], [6, 10], [12, 2]]).repeat(32, 0)
        eval_label = np.array([11, 26, 16]).repeat(32, 0)

        # save training data
        for path in ['training', 'evaluation']:
            os.makedirs(os.path.join(opt_ml, 'input', 'data', path))
        np.savetxt(os.path.join(opt_ml, 'input/data/training/train_data.txt.gz'), train_data)
        np.savetxt(os.path.join(opt_ml, 'input/data/training/train_label.txt.gz'), train_label)
        np.savetxt(os.path.join(opt_ml, 'input/data/evaluation/eval_data.txt.gz'), eval_data)
        np.savetxt(os.path.join(opt_ml, 'input/data/evaluation/eval_label.txt.gz'), eval_label)

        s3_source_archive = fw_utils.tar_and_upload_dir(session=sagemaker_session.boto_session,
                                    bucket=sagemaker_session.default_bucket(),
                                    s3_key_prefix=sagemaker_timestamp(),
                                    script='linear_regression.py',
                                    directory=resource_path)

        utils.create_config_files('linear_regression.py', s3_source_archive.s3_prefix, opt_ml)
        os.makedirs(os.path.join(opt_ml, 'model'))

>       docker_utils.train(docker_image, opt_ml, processor)

test/integ/test_linear_regression.py:51:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/integ/docker_utils.py:46: in train
    check_call(cmd)
test/integ/docker_utils.py:53: in check_call
    subprocess.check_call(cmd, *popenargs, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

popenargs = (['docker', 'run', '--rm', '-h', 'algo-1', '-v', ...],), kwargs = {}, retcode = 1
cmd = ['docker', 'run', '--rm', '-h', 'algo-1', '-v', ...]

    def check_call(*popenargs, **kwargs):
        """Run command with arguments.  Wait for command to complete.  If
        the exit code was zero then return, otherwise raise
        CalledProcessError.  The CalledProcessError object will have the
        return code in the returncode attribute.

        The arguments are the same as for the call function.  Example:

        check_call(["ls", "-l"])
        """
        retcode = call(*popenargs, **kwargs)
        if retcode:
            cmd = kwargs.get("args")
            if cmd is None:
                cmd = popenargs[0]
>           raise CalledProcessError(retcode, cmd)
E           subprocess.CalledProcessError: Command '['docker', 'run', '--rm', '-h', 'algo-1', '-v', '/tmp/tmpqmgj8ban:/opt/ml', '-e', 'AWS_ACCESS_KEY_ID', '-e', 'AWS_SECRET_ACCESS_KEY', '-e', 'AWS_SESSION_TOKEN', 'mxnet-mkl-1.3.0b20180625-py3:latest', 'train']' returned non-zero exit status 1.

../../anaconda3/envs/JupyterSystemEnv/lib/python3.6/subprocess.py:291: CalledProcessError
-------------------------------------------------- Captured stderr setup --------------------------------------------------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
--------------------------------------------------- Captured log setup ----------------------------------------------------
connectionpool.py          203 INFO     Starting new HTTP connection (1): 169.254.169.254
connectionpool.py          203 INFO     Starting new HTTP connection (1): 169.254.169.254
-------------------------------------------------- Captured stdout call ---------------------------------------------------
executing docker command: docker run --rm -h algo-1 -v /tmp/tmpqmgj8ban:/opt/ml -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN mxnet-mkl-1.3.0b20180625-py3:latest train
Downloading s3://sagemaker-us-west-2-{acct}/2018-06-29-08-09-02-183/sourcedir.tar.gz to /tmp/script.tar.gz
-------------------------------------------------- Captured stderr call ---------------------------------------------------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
2018-06-29 08:09:02,885 INFO - root - running container entrypoint
2018-06-29 08:09:02,885 INFO - root - starting train task
2018-06-29 08:09:02,889 INFO - container_support.training - Training starting
2018-06-29 08:09:03,491 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'output_dir': '/opt/ml/output', 'channel_dirs': {'evaluation': '/opt/ml/input/data/evaluation', 'training': '/opt/ml/input/data/training', 'Validation': '/opt/ml/input/data/Validation'}, 'base_dir': '/opt/ml', 'model_dir': '/opt/ml/model', '_ps_verbose': 0, 'available_cpus': 4, 'user_script_name': 'linear_regression.py', 'container_log_level': 20, 'input_config_dir': '/opt/ml/input/config', 'hyperparameters': {'sagemaker_region': 'us-west-2', 'sagemaker_container_log_level': 20, 'sagemaker_submit_directory': 's3://sagemaker-us-west-2-{acct}/2018-06-29-08-09-02-183/sourcedir.tar.gz', 'sagemaker_program': 'linear_regression.py'}, '_ps_port': 8000, '_scheduler_host': 'algo-1', 'code_dir': '/opt/ml/code', 'channels': {'evaluation': {'ContentType': 'evalContentType'}, 'training': {'ContentType': 'trainingContentType'}, 'Validation': {}}, 'sagemaker_region': 'us-west-2', 'user_script_archive': 's3://sagemaker-us-west-2-{acct}/2018-06-29-08-09-02-183/sourcedir.tar.gz','user_requirements_file': None, 'available_gpus': 0, 'output_data_dir': '/opt/ml/output/data/', 'hosts': ['algo-1'], 'current_host': 'algo-1', 'resource_config': {'hosts': ['algo-1'], 'current_host': 'algo-1'}, 'job_name': None, 'input_dir': '/opt/ml/input', '_scheduler_ip': '172.17.0.2'}
2018-06-29 08:09:03,506 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.169.254
2018-06-29 08:09:03,509 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.169.254
2018-06-29 08:09:03,546 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.amazonaws.com
2018-06-29 08:09:03,586 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-{acct}.s3.amazonaws.com
2018-06-29 08:09:03,603 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
2018-06-29 08:09:03,630 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
2018-06-29 08:09:03,662 ERROR - container_support.training - uncaught exception during training: An error occurred (403) when calling the HeadObject operation: Forbidden
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 36, in start
    fw.train()
  File "/usr/local/lib/python3.5/dist-packages/mxnet_container/train.py", line 169, in train
    mxnet_env.download_user_module()
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 89, in download_user_module
    cs.download_s3_resource(self.user_script_archive, tmp)
  File "/usr/local/lib/python3.5/dist-packages/container_support/utils.py", line 37, in download_s3_resource
    script_bucket.download_file(script_key_name, target)
  File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
    ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
  File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 172, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python3.5/dist-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 73, in result
    return self._coordinator.result()
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 233, in result
    raise self._exception
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 255, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/download.py", line 353, in _submit
    **transfer_future.meta.call_args.extra_args
  File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

---------------------------------------------------- Captured log call ----------------------------------------------------
connectionpool.py          735 INFO     Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py          735 INFO     Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
connectionpool.py          735 INFO     Starting new HTTPS connection (1): sagemaker-us-west-2-{acct}.s3.us-west-2.amazonaws.com
yangaws commented 6 years ago

Hi @yifeim ,

How do you set the credential? Is that by exporting to environment variables? For this line in the error message, docker consumes credentials from env variables.

docker run --rm -h algo-1 -v /tmp/tmpqmgj8ban:/opt/ml -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN mxnet-mkl-1.3.0b20180625-py3:latest train

yifeim commented 6 years ago

Hi @yangaws ,

We overcame this issue with using sagemaker bring-your-own docker images. It also provides better packaging since no custom codes need to be accessed during training time.

Unfortunately, I don't think we tried to dig further since the last update. It is understandable that accessing s3 buckets can be tricky at times, especially in nonstandard environments like docker or swf.

Suggest to close this ticket, but please feel free to reopen if there are additional questions that we may be able to answer.

Thanks for the many supports in the previous posts.