aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.07k stars 1.12k forks source link

Add output_data property to EstimatorBase class #1936

Open ynouri opened 3 years ago

ynouri commented 3 years ago

Describe the feature you'd like

As documented here: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.model_data, the EstimatorBase class provides a convenient method to pointing to the model .tar.gz archive location in S3 once estimator.fit() has been called.

Additionally to model data, SageMaker provides the ability to generate "output data" (different from "model data") when dumping files during training (e.g. experiment logs) to the directory defined by the environment variable SM_OUTPUT_DATA_DIR

Having a similar property in the EstimatorBase class, pointing to the output data .tar.gz archive location in S3 would be useful for developers wishing to manipulate that archive. It could be named, for example, estimator.output_data

How would this feature be used? Please describe.

Example use case:

  1. Fit an estimator
  2. During training, dump some files of interest to output data dir
  3. Files are archived in output.tar.gz by SageMaker after .fit()
  4. Access S3 location by using estimator.output_data and download output data archive locally.
  5. Use output data archive locally, e.g. consult experiment logs

Describe alternatives you've considered

Compute the output_data location manually (potentially re-using the .model_data property)

metrizable commented 3 years ago

Hello @ynouri ,

Thank you for using Amazon SageMaker.

It's an interesting feature request. The output data from Processor instances is of a similar ilk. The output data plays the central role, rather than model artifacts of Estimator training jobs.

As you mentioned, there are work-arounds that, as noted, are not exposed as attributes of an estimator instance.

We are always re-evaluating our backlog of features based on customer requests, so we appreciate the feedback on this feature.