aws / sagemaker-containers

WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.
https://github.com/aws/sagemaker-training-toolkit
Apache License 2.0
185 stars 89 forks source link

Errors when training with a custom MXNet container #77

Closed juliensimon closed 6 years ago

juliensimon commented 6 years ago

Hi,

I built a custom MXNet container using https://github.com/aws/sagemaker-mxnet-containers, and pushed it to ECR. The container is fine as far as I can tell (inspecting with 'docker run', etc).

When I run this:

mxnet = sagemaker.estimator.Estimator(
                       image_name=image,
                       role=role, 
                       train_instance_count=1, 
                       train_instance_type='ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

mxnet.fit(data_location)

The training job fails with:

ValueError: Error training mxnet-2018-05-27-12-08-04-153: Failed Reason: AlgorithmError: uncaught exception during training: 'sagemaker_region'
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 31, in start
    env = TrainingEnvironment()
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 219, in __init__
    self.sagemaker_region = self.hyperparameters[ContainerEnvironment.SAGEMAKER_REGION_PARAM_NAME]
KeyError: 'sagemaker_region'

It looks like I'd need to set a 'sagemaker_region' parameter, which is weird because SageMaker should know what the region is.

Anyway, if I try to set it (in the Estimator or with set_hyperparameters):

mxnet.set_hyperparameters(sagemaker_region=region)
mxnet.fit(data_location)

Then the job fails because it can't deserialize hyperparameters:

ValueError: Error training mxnet-2018-05-27-12-22-10-626: Failed Reason: AlgorithmError: uncaught exception during training: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 31, in start
    env = TrainingEnvironment()
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 182, in __init__
    os.path.join(self.input_config_dir, TrainingEnvironment.HYPERPARAMETERS_FILE))
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 224, in _load_hyperparameters
    return self._deserialize_hyperparameters(serialized)
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 238, in _deserialize_hyperparameters
    hyperparameter_dict[k] = json.loads(v)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib

Have I missed anything? Thanks for your help.

JohnCalhoun commented 6 years ago

had the same problem, the code will run json.loads on each hyperparameter, so you have to set you region parameters to '"us-east-1"' or something like that (notice the double quotes)

juliensimon commented 6 years ago

Thanks. I ended up reading that code too :)

This should definitely be fixed:

winstonaws commented 6 years ago

Hi, thanks for the feedback!

There's a problem we've known about for a while with the Python SDK - the "official" MXNet/TensorFlow images were meant to be used with the MXNet/TensorFlow estimator classes, which automatically provide the region and hyperparameter serialization.

We need to add a constructor arg to those classes to allow the default images to be overridden - this would be the best way to have a clean experience. (there's some trickiness with attach we have to work out in order to implement this.) Then, you would you the MXNet class instead of the generic Estimator class with your image.

I'll +1 the priority of this issue in our backlog.

In the meantime, you can also do something like: https://github.com/aws/sagemaker-mxnet-containers/blob/master/test/functional/test_mnist_distributed.py#L22

juliensimon commented 6 years ago

Thanks Winston.

winstonaws commented 6 years ago

Closing, please reopen if you encounter more blockers or have more feedback.