aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

(TF) estimator.attach() doesn't override latest_training_job for deployment #1438

Open athewsey opened 4 years ago

athewsey commented 4 years ago

Describe the bug

When estimator.attach()ing to a training job older than the most recent job previously kicked off by .fit() on that same estimator object; Calling .deploy() does not properly deploy the attached job as expected; but tries to interrogate artifacts from the more recent (perhaps still in-progress) job - which can trigger an error.

This issue is probably only really relevant for interactive session use cases e.g. notebooks.

To reproduce

Expected behavior

Because we've explicitly attached the estimator to a job since the last time we triggered one, would expect subsequent deploy() calls to refer to the attach()ed job, regardless of the relative age of the job.

Screenshots or logs (Actual behavior)

In actual fact, a KeyError is thrown because the estimator tries accessing model artifacts from the self.latest_training_job.name rather than the attached job (and we ran both attach() and deploy() before that most recent job finished):

KeyError                                  Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
    685         else:
    686             kwargs["model_kms_key"] = self.output_kms_key
--> 687             model = self.create_model(**kwargs)
    688 
    689         model.name = model_name

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in create_model(self, model_server_workers, role, vpc_config_override, endpoint_type, entry_point, source_dir, dependencies, **kwargs)
    592                 source_dir=source_dir,
    593                 dependencies=dependencies,
--> 594                 **kwargs
    595             )
    596 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in _create_tfs_model(self, role, vpc_config_override, entry_point, source_dir, dependencies, **kwargs)
    623 
    624         return Model(
--> 625             model_data=self.model_data,
    626             role=role,
    627             image=(image or self.image_name),

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in model_data(self)
    709             model_uri = self.sagemaker_session.sagemaker_client.describe_training_job(
    710                 TrainingJobName=self.latest_training_job.name
--> 711             )["ModelArtifacts"]["S3ModelArtifacts"]
    712         else:
    713             logging.warning(

KeyError: 'ModelArtifacts'

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

laurenyu commented 4 years ago

the reason you're seeing this behavior is because attach() is a class method that returns a new instance of the estimator. This is so you don't have to instantiate a new estimator just to attach an old training job, i.e.

estimator = TensorFlow.attach(...)

instead of

tf = TensorFlow(...)
estimator = tf.attach(...)

I can see how this is unintuitive, though, when using it with an existing estimator, so I'll leave this issue open and marked as an "enhancement"

athewsey commented 4 years ago

Understood, thanks!