aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
126 stars 36 forks source link

fix experiment.delete_all() throttle #91

Closed yzhu0 closed 4 years ago

yzhu0 commented 4 years ago

SIM: https://sim.amazon.com/issues/AML-78535

  1. Fix DisassociateTrialComponent throttling, since we already do disassociate and delete trial_component with code tc.delete(force_disassociate=True). So t.remove_trial_component(tc) is not needed.

  2. add sleep time under trial and experiment also contain 1s throttle time. add a time.sleep between experiment and trial deletion. Based on https://tiny.amazon.com/ftd9gdrx/codeamazpackIronbloba686conf

Error:

my_experiment.delete_all(action="--force")

ClientError Traceback (most recent call last) /opt/conda/lib/python3.7/site-packages/smexperiments/experiment.py in delete_all(self, action) 261 tc.delete(force_disassociate=True) --> 262 t.remove_trial_component(tc) 263 # to prevent throttling

/opt/conda/lib/python3.7/site-packages/smexperiments/trial.py in remove_trial_component(self, tc) 241 self.sagemaker_boto_client.disassociate_trial_component( --> 242 TrialName=self.trial_name, TrialComponentName=trial_component_name 243 )

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 315 # The "self" in this scope is referring to the BaseClient. --> 316 return self._make_api_call(operation_name, kwargs) 317

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 634 error_class = self.exceptions.from_code(error_code) --> 635 raise error_class(parsed_response, operation_name) 636 else:

ClientError: An error occurred (ThrottlingException) when calling the DisassociateTrialComponent operation (reached max retries: 4): Rate exceeded

The above exception was the direct cause of the following exception:

Exception Traceback (most recent call last)

in ----> 1 my_experiment.delete_all(action="--force") /opt/conda/lib/python3.7/site-packages/smexperiments/experiment.py in delete_all(self, action) 248 while True: 249 if delete_count == 3: --> 250 raise Exception("Fail to delete, please try again.") from last_exception 251 try: 252 for trial_summary in self.list_trials(): Exception: Fail to delete, please try again.

Test

Tested the delete_all() with autoPilot in sagemaker studio, was able to delete the experiment and related t/tc.

input:

from smexperiments.experiment import Experiment from smexperiments.trial import Trial from smexperiments.trial_component import TrialComponent from smexperiments.tracker import Tracker import time

def cleanup(experiment): delete_count = 0 last_exception = None while True: if delete_count == 3: raise Exception("Fail to delete, please try again.") from last_exception try: for trial_summary in experiment.list_trials(): t = Trial.load( sagemaker_boto_client=experiment.sagemaker_boto_client, trial_name=trial_summary.trial_name ) for trial_component_summary in t.list_trial_components(): tc = TrialComponent.load( sagemaker_boto_client=experiment.sagemaker_boto_client, trial_component_name=trial_component_summary.trial_component_name, ) tc.delete(force_disassociate=True)

to prevent throttling

                    time.sleep(1.2)
                t.delete()
            experiment.delete()
            break
        except Exception as ex:
            last_exception = ex
        finally:
            delete_count = delete_count + 1

cleanup(my_experiment)

my_experiment = experiment.Experiment.load('tutorial-autopilot-aws-auto-ml-job', sagemaker_boto_client=cc)

output:


ResourceNotFound Traceback (most recent call last)

in ----> 1 my_experiment = experiment.Experiment.load('tutorial-autopilot-aws-auto-ml-job', sagemaker_boto_client=cc) /opt/conda/lib/python3.7/site-packages/smexperiments/experiment.py in load(cls, experiment_name, sagemaker_boto_client) 92 """ 93 return cls._construct( ---> 94 cls._boto_load_method, experiment_name=experiment_name, sagemaker_boto_client=sagemaker_boto_client, 95 ) 96