aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

debuggerHook is not saving tensors in s3 #1907

Open tiru1930 opened 4 years ago

tiru1930 commented 4 years ago

Describe the bug

Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded | Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded

To reproduce Train FrameWork Xgboost with debugger hook as below

from sagemaker.xgboost import XGBoost
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig

hyperparams = {"max_depth":5,
               "subsample":0.8,
               "num_round":600,
               "eta":0.2,
               "gamma":4,
               "min_child_weight":6,
               "silent":0,
               "objective":'multi:softmax',
               "num_class":len(le.classes_),
               "smdebug_path":f"s3://{bucket}/{prefix}/debug",
               "smdebug_collections":"metrics,feature_importance"
              }
save_interval = 5

entry_point_script = "xgboost_dest_prediction.py"

trial = Trial.create(trial_name="framework-mode-trial-{}".format(time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())), 
                     experiment_name=destination_prediction_experiment.experiment_name,
                     sagemaker_boto_client=boto3.client('sagemaker'))

framework_xgb = XGBoost(
                      entry_point=entry_point_script,
                      role=sagemaker.get_execution_role(),
                      framework_version='0.90-2',
                      py_version="py3",
                      hyperparameters=hyperparams,
                      instance_count=1, 
                      instance_type='ml.m4.xlarge',
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      base_job_name="demo-xgboost-destination-prediction",
                      sagemaker_session=sm_sess,
#                       rules=debug_rules,
                      use_spot_instances = True,
                      max_run = 3600,
                      max_wait = 3600,
                      input_mode = 'File',
                      debugger_hook_config=DebuggerHookConfig(
                            s3_output_path=f"s3://{bucket}/{prefix}/debug",  # Required
                            collection_configs=[
                                CollectionConfig(
                                    name="metrics",
                                    parameters={
                                        "save_interval": str(save_interval)
                                    }
                                )
                            ],
                        ),

                      rules=[
                            Rule.sagemaker(
                                rule_configs.loss_not_decreasing(),
                                rule_parameters={
                                    "collection_names": "metrics",
                                    "num_steps": str(save_interval * 2),
                                },
                            ),
                        ],

                    )

framework_xgb.fit({'train': s3_input_train,
                   'validation': s3_input_validation}, 
                  experiment_config={
                      "ExperimentName": destination_prediction_experiment.experiment_name, 
                      "TrialName": trial.trial_name,
                      "TrialComponentDisplayName": "Training",
                  })

Expected behavior I should get tensors saved in s3

Screenshots or logs

[{'RuleConfigurationName': 'LossNotDecreasing',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:990360540682:processing-job/demo-xgboost-destination-p-lossnotdecreasing-abb2296f',
  'RuleEvaluationStatus': 'Error',
  'StatusDetails': 'ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded\nTraceback (most recent call last):\n  File "evaluate.py", line 112, in _create_trials\n    range_steps=(self.start_step, self.end_step))\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 20, in create_trial\n    return LocalTrial(name=name, dirname=path, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__\n    self._load_collections()\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections\n    _wait_for_collection_files(1)  # wait for the first collection file\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files\n    raise MissingCollectionFiles\nsmdebug.exceptions.MissingCollectionFiles: Trainin',
  'LastModifiedTime': datetime.datetime(2020, 9, 18, 11, 6, 27, 290000, tzinfo=tzlocal())}]

System information SageMaker Python SDK version: 2.6 Framework name (eg. PyTorch) or algorithm (eg. KMeans): xgboost frame work Framework version: 0.90-2 Python version: 3.8 CPU or GPU: CPU Custom Docker image (Y/N): N

icywang86rui commented 4 years ago

@tiru1930 Sorry that you run into this problem.

tiru1930 commented 4 years ago

Is the Iam role you used has the proper permission setup to access the debug S3 bucket s3://{bucket}/{prefix}/debug Yes, Could you show me the entire training job if possible? do you mean logs?

How long does the training job run? Did it successfully generated a model? it ran for few mins, and yes, it generated model and stored in s3, i was able to deploy

icywang86rui commented 4 years ago

Could you show me the entire training job if possible? do you mean logs?

Yes. sorry for the delay. are you still experiencing this problem?

craigbosco commented 1 year ago

Hi, I'm having this issue as well. The model will train successfully and I can deploy it to an endpoint, but the "training_job_end.ts" file is empty.

Here's the estimator object I'm using:

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=s3_xgb_output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=s3_xgb_output_location,
        collection_configs=[
            CollectionConfig(name="metrics", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(
                name="feature_importance", parameters={"save_interval": str(save_interval)}
            ),
            CollectionConfig(name="full_shap", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)}),
        ],
    ),
    rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval * 2),
            },
        ),
    ],                          
)

Then, I try to get the smdebug trial artifacts:

s3_output_path = xgb.latest_job_debugger_artifacts_path()
trial = smd.create_trial(s3_output_path)

Exception: MissingCollectionFiles: Training job has ended. All the collection files could not be loaded