aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

Ability to run locally? #143

Closed professoroakz closed 6 years ago

professoroakz commented 6 years ago

Hey!

I have built a keras model using the Sagemaker API, but my development process is incredibly slow, since I have to wait 4-5 minutes after each code change in order to run my code on Sagemaker, and I would love to run this 100% locally so I can push the code that I know will work on Sagemaker.

I saw in the documentation that you should be able to set , train_instance_type='local', but when I try to do this, I get the following error:

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.c4.8xlarge, ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

and I am invoking it in the following way:

estimator = TensorFlow(entry_point='itemembd.py',
                               role=role,
                               training_steps=100,
                               evaluation_steps= 100,
                               train_instance_count=1,
                               train_instance_type='ml.c4.xlarge',
                               output_path='s3://ml/artifacts/itemembd',

                              )

estimator.fit('s3://ml/data/itemembd', job_name='itememdb-notebook-12')

This feature would be amazing to have, since this is a huge bottleneck while I'm trying to evaluate Sagemaker for enterprise use.

iquintero commented 6 years ago

hi @OktayGardener

from your conversation in the other issue (137) we figured you were running with an older sdk. I remember you posted another update but I don't see it anymore. Anyways, It looked to me like you didn't have

docker and docker-compose installed in your system, both of which are required for local mode. Note that there is currently a bug for which I opened a PR and will merge it asap.

professoroakz commented 6 years ago

Hey @iquintero, thank you so much for your reply and in the other issue, and also for the quick implementation. When pointing to S3 and building the container locally, I'm getting the following error:

algo-1-R6MV7_1  | 2018-04-15 20:43:12,188 INFO - root - running container entrypoint
algo-1-R6MV7_1  | 2018-04-15 20:43:12,188 INFO - root - starting train task
algo-1-R6MV7_1  | 2018-04-15 20:43:12,206 INFO - container_support.training - Training starting
algo-1-R6MV7_1  | /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated.In future, it will be treated as `np.float64 == np.dtype(float).type`.
algo-1-R6MV7_1  |   from ._conv import register_converters as _register_converters
algo-1-R6MV7_1  | 2018-04-15 20:43:12,922 INFO - botocore.credentials - Found credentials in environment variables.
algo-1-R6MV7_1  | Downloading s3://com.tictail.sagemaker/customcode/itemembd/sagemaker-tensorflow-2018-04-15-20-42-33-467/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-R6MV7_1  | 2018-04-15 20:43:13,019 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
algo-1-R6MV7_1  | 2018-04-15 20:43:14,117 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.eu-west-1.amazonaws.com
algo-1-R6MV7_1  | 2018-04-15 20:43:14,354 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): s3.eu-west-1.amazonaws.com
algo-1-R6MV7_1  | 2018-04-15 20:43:14,356 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (3): s3.eu-west-1.amazonaws.com
algo-1-R6MV7_1  | 2018-04-15 20:43:21,328 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
algo-1-R6MV7_1  | 2018-04-15 20:43:22,163 INFO - tf_container - ----------------------TF_CONFIG--------------------------
algo-1-R6MV7_1  | 2018-04-15 20:43:22,163 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1-R6MV7:2222"]}, "task": {"index": 0, "type": "master"}}
algo-1-R6MV7_1  | 2018-04-15 20:43:22,163 INFO - tf_container - ---------------------------------------------------------
algo-1-R6MV7_1  | 2018-04-15 20:43:22,163 INFO - tf_container - creating RunConfig:
algo-1-R6MV7_1  | 2018-04-15 20:43:22,164 INFO - tf_container - {'save_checkpoints_secs': 300}
algo-1-R6MV7_1  | 2018-04-15 20:43:22,164 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1-R6MV7:2222']}, u'task': {u'index': 0, u'type': u'master'}}
algo-1-R6MV7_1  | 2018-04-15 20:43:22,165 INFO - tf_container - creating the estimator
algo-1-R6MV7_1  | 2018-04-15 20:43:22,165 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f159c35b390>, '_model_dir': u's3://com.tictail.sagemaker/artifacts/itemembd/sagemaker-tensorflow-2018-04-15-20-42-33-467/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
algo-1-R6MV7_1  | 2018-04-15 20:43:22,167 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
algo-1-R6MV7_1  | 2018-04-15 20:43:22.168469: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
algo-1-R6MV7_1  | 2018-04-15 20:43:22.168568: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
algo-1-R6MV7_1  | 2018-04-15 20:43:22.168716: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and/root//.aws/config for the config file , for use with profile default
algo-1-R6MV7_1  | 2018-04-15 20:43:22.168800: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
algo-1-R6MV7_1  | 2018-04-15 20:43:22.168854: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
algo-1-R6MV7_1  | 2018-04-15 20:43:22.168979: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000
algo-1-R6MV7_1  | 2018-04-15 20:43:22.169044: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-R6MV7_1  | 2018-04-15 20:43:22.169890: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
algo-1-R6MV7_1  | 2018-04-15 20:43:22.170024: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-R6MV7_1  | 2018-04-15 20:43:22.170289: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
algo-1-R6MV7_1  | 2018-04-15 20:43:22.170420: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
algo-1-R6MV7_1  | 2018-04-15 20:43:22.493854: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-R6MV7_1  | 2018-04-15 20:43:22.494042: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-R6MV7_1  | 2018-04-15 20:43:22.494499: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-R6MV7_1  | 2018-04-15 20:43:22.494871: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
algo-1-R6MV7_1  | 2018-04-15 20:43:23,039 ERROR - container_support.training - uncaught exception during training: /opt/ml/input/data/training/user_item_3month_cleaned.csv; No such fileor directory

.........

algo-1-LGL40_1  | NotFoundError: /opt/ml/input/data/training/data.csv; No such file or directory

.........

    raise Exception("Failed to run %s, exit code: %s" % (",".join(cmd), exit_code))
Exception: Failed to run docker-compose,-f,/private/var/folders/n8/hrylchcd2vl8s0r6j2n5th9r0000gn/T/tmpTdmsPp/docker-compose.yaml,up,--build,--abort-on-container-exit, exit code: 1

Here's how I'm pointing to the files:

def train_input_fn(training_dir, params):
    return _input_fn(training_dir, 'user_item_3month_cleaned.csv')

def eval_input_fn(training_dir, params):
    return _input_fn(training_dir, 'user_item_1month_cleaned.csv')

def _input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.int,
        features_dtype=np.int
    )

It seems like there is some path issue when doing this locally. I'm not sure if it's relevant for the PR. Am I doing something wrong?

iquintero commented 6 years ago

hi @OktayGardener are you using the default bucket for your training? If not, you are hitting the bug that I fixed in #144 You can upgrade to the master branch, and then you can use whatever bucket you want:

pip install  git+https://github.com/aws/sagemaker-python-sdk

this fix will be released to PyPI on tuesday afternoon (PDT). so you won't have to install from the git master branch after that.

professoroakz commented 6 years ago

Works like a charm. Thank you so much <3

iquintero commented 6 years ago

This is now on PyPI (version 1.2.3). Im going to close this issue.