(TF) estimator.attach() doesn't override latest_training_job for deployment

athewsey commented 4 years ago

Describe the bug

When estimator.attach()ing to a training job older than the most recent job previously kicked off by .fit() on that same estimator object; Calling .deploy() does not properly deploy the attached job as expected; but tries to interrogate artifacts from the more recent (perhaps still in-progress) job - which can trigger an error.

This issue is probably only really relevant for interactive session use cases e.g. notebooks.

To reproduce

Create a TensorFlow estimator
Prepare at least one successful historical training job that the estimator can be attach()ed to and deploy.
Trigger a new training job (.fit()) and resume the interactive shell before that job is finished (e.g. by specifying wait=False, or just using Ctrl+C)
estimator.attach() to the previous, completed and valid training job
Call estimator.deploy()

Expected behavior

Because we've explicitly attached the estimator to a job since the last time we triggered one, would expect subsequent deploy() calls to refer to the attach()ed job, regardless of the relative age of the job.

Screenshots or logs (Actual behavior)

In actual fact, a KeyError is thrown because the estimator tries accessing model artifacts from the self.latest_training_job.name rather than the attached job (and we ran both attach() and deploy() before that most recent job finished):

KeyError                                  Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
    685         else:
    686             kwargs["model_kms_key"] = self.output_kms_key
--> 687             model = self.create_model(**kwargs)
    688 
    689         model.name = model_name

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in create_model(self, model_server_workers, role, vpc_config_override, endpoint_type, entry_point, source_dir, dependencies, **kwargs)
    592                 source_dir=source_dir,
    593                 dependencies=dependencies,
--> 594                 **kwargs
    595             )
    596 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in _create_tfs_model(self, role, vpc_config_override, entry_point, source_dir, dependencies, **kwargs)
    623 
    624         return Model(
--> 625             model_data=self.model_data,
    626             role=role,
    627             image=(image or self.image_name),

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in model_data(self)
    709             model_uri = self.sagemaker_session.sagemaker_client.describe_training_job(
    710                 TrainingJobName=self.latest_training_job.name
--> 711             )["ModelArtifacts"]["S3ModelArtifacts"]
    712         else:
    713             logging.warning(

KeyError: 'ModelArtifacts'

System information A description of your system. Please provide:

SageMaker Python SDK version: 1.55.3 (SageMaker conda_python3 kernel)
Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
Framework version: 1.12
Python version: 3.6.5 (SageMaker conda_python3 kernel)
CPU or GPU: CPU
Custom Docker image (Y/N): N

Additional context Add any other context about the problem here.

laurenyu commented 4 years ago

the reason you're seeing this behavior is because attach() is a class method that returns a new instance of the estimator. This is so you don't have to instantiate a new estimator just to attach an old training job, i.e.

estimator = TensorFlow.attach(...)

instead of

tf = TensorFlow(...)
estimator = tf.attach(...)

I can see how this is unintuitive, though, when using it with an existing estimator, so I'll leave this issue open and marked as an "enhancement"

athewsey commented 4 years ago

Understood, thanks!

aws / sagemaker-python-sdk

(TF) estimator.attach() doesn't override latest_training_job for deployment #1438