Closed lhorus closed 5 years ago
Ray doesn't save a model that would be accepted by tensorflow serving (TensorFlow serving solution used by default in SageMaker). This is the reason for NotImplementedError in the python SDK.
You didn't get NotImplementedError because roboschool example is using BYOC (bring your own container) instead of ray container provided by SageMaker.
According to user script in the example checkpoints should be saved to 'opt/ml/output/intermediate' folder and moved to s3://<your_s3_bucket>/<training_job_name>/output/intermediate
location during training.
You can modify the user script to save checkpoints to /opt/ml/model
directory at the end of the training instead.
We appreciate your feedback and always prioritize our work based on customer's requests, we'll look into improving the roboschool ray example when we can.
I am going to close this issue since it is not related to SageMaker Python SDK.
Please, feel free to open any notebooks examples related questions in the corresponding repository.
System Information
Describe the problem
Model saving upon training completion never occurs, meaning,
output.tar.gz
is never saved. Even running the base examples, such as roboschool on ray, fails to save the output file. All the intermediate files, snapshots and so on are saved on the correct folder.Minimal repro / logs
To provide actual examples, here is the basic test on the provided Jupyter notebook on Ray's RLLib:
However, going to the directory manually, or via Training Jobs' artifact directory link, both
output.tar.gz
andmodel.tar.gz
are nowhere to be seen. Additionally, and to show it truly isn't created, here is what happens when I try to fetch it:The only thing I found which may be Ray specific, is this small snippet on
RL.Estimator.create_model()
:Doesn't seem to have anything to do with it, as it simply doesn't have a model container suitable for returning a compiled Ray model.