aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

Ray RLLib examples not saving model output #581

Closed lhorus closed 5 years ago

lhorus commented 5 years ago

System Information

Describe the problem

Model saving upon training completion never occurs, meaning, output.tar.gz is never saved. Even running the base examples, such as roboschool on ray, fails to save the output file. All the intermediate files, snapshots and so on are saved on the correct folder.

Minimal repro / logs

To provide actual examples, here is the basic test on the provided Jupyter notebook on Ray's RLLib:

metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)

estimator = RLEstimator(entry_point="train-%s.py" % roboschool_problem,
                        source_dir='src',
                        dependencies=["common/sagemaker_rl"],
                        image_name=custom_image_name,
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        metric_definitions=metric_definitions,
                    )

estimator.fit(wait=local_mode)
job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

INFO:sagemaker:Creating training-job with name: rl-roboschool-reacher-2018-12-27-17-38-44-838 2018-12-27 17:47:54 Uploading - Uploading generated training model 2018-12-27 17:47:54 Completed - Training job completed Billable seconds: 441 Training job: rl-roboschool-reacher-2018-12-27-17-38-44-838 CPU times: user 1.22 s, sys: 55.2 ms, total: 1.27 s Wall time: 9min 46s

However, going to the directory manually, or via Training Jobs' artifact directory link, both output.tar.gz andmodel.tar.gz are nowhere to be seen. Additionally, and to show it truly isn't created, here is what happens when I try to fetch it:

predictor = estimator.deploy(instance_type='ml.t2.medium',
                                initial_instance_count=1)

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-east-2->109563362291/rl-roboschool-reacher-2018-12-27-17-38-44-838 /output/model.tar.gz.

The only thing I found which may be Ray specific, is this small snippet on RL.Estimator.create_model() :

 if self.toolkit == RLToolkit.RAY.value:
            raise NotImplementedError(
                'Automatic deployment of Ray models is not currently available.'
                ' Train policy parameters are available in model checkpoints'
                ' in the TrainingJob output.'
            )

Doesn't seem to have anything to do with it, as it simply doesn't have a model container suitable for returning a compiled Ray model.

nadiaya commented 5 years ago

Ray doesn't save a model that would be accepted by tensorflow serving (TensorFlow serving solution used by default in SageMaker). This is the reason for NotImplementedError in the python SDK.

You didn't get NotImplementedError because roboschool example is using BYOC (bring your own container) instead of ray container provided by SageMaker.

According to user script in the example checkpoints should be saved to 'opt/ml/output/intermediate' folder and moved to s3://<your_s3_bucket>/<training_job_name>/output/intermediate location during training. You can modify the user script to save checkpoints to /opt/ml/model directory at the end of the training instead.

We appreciate your feedback and always prioritize our work based on customer's requests, we'll look into improving the roboschool ray example when we can.

I am going to close this issue since it is not related to SageMaker Python SDK.

Please, feel free to open any notebooks examples related questions in the corresponding repository.