[Bug Report] Rule - class_imbalance on XGBoost

Prathzee commented 3 years ago

I am facing this error while running class_imbalance on XGboost

'RuleConfigurationName': 'ClassImbalance', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded\nTraceback (most recent call last):\n File "evaluate.py" .......

I have added below mentioned collections in the DebuggerHookConfig metrics predictions labels.

Can you suggest me what I am missing over here?

RajkoReijnen commented 3 years ago

I have the exact same issue and I do not have clue what causes it. Spend the whole day changing the code, trying with subset of my dataset and endless waiting. Sagemaker does not provide any clues on how to fix the problem.

jkroll-aws commented 3 years ago

@Prathzee @RajkoReijnen Can you provide the job configuration or job logs?

Lewington-pitsos commented 2 years ago

Exact same issue here, my config looks like:

estimator_config = {
...
"debugger_hook_config": DebuggerHookConfig(
    collection_configs=[
        CollectionConfig(name="all")
    ],
),
'rules': [
    Rule.sagemaker(base_config=rule_configs.dead_relu()),
    Rule.sagemaker(base_config=rule_configs.vanishing_gradient())
]
}
...

image_classifier = sagemaker.estimator.Estimator(**estimator_config)

here are my logs

2022-01-11 01:31:13 Starting - Launching requested ML instancesDeadRelu: InProgress
VanishingGradient: InProgress
ProfilerReport-1641864648: InProgress
......
2022-01-11 01:32:14 Starting - Preparing the instances for training.........
2022-01-11 01:33:49 Downloading - Downloading input data
2022-01-11 01:33:49 Training - Downloading the training image......
2022-01-11 01:34:49 Training - Training image download completed. Training in progress.Docker entrypoint called with argument(s): train
[01/11/2022 01:34:39 INFO 139782187484992] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/image_classification/default-input.json: {'use_pretrained_model': 0, 'num_layers': 152, 'epochs': 30, 'learning_rate': 0.1, 'lr_scheduler_factor': 0.1, 'optimizer': 'sgd', 'momentum': 0, 'weight_decay': 0.0001, 'beta_1': 0.9, 'beta_2': 0.999, 'eps': 1e-08, 'gamma': 0.9, 'mini_batch_size': 32, 'image_shape': '3,224,224', 'precision_dtype': 'float32'}
[01/11/2022 01:34:39 INFO 139782187484992] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'augmentation_type': 'crop_color_transform', 'num_classes': '3', 'eps': '1e-07', 'num_training_samples': '150', 'early_stopping': 'True', 'multi_label': '1', 'image_shape': '3,405,720', 'early_stopping_patience': '2', 'optimizer': 'adam', 'use_pretrained_model': '1', 'precision_dtype': 'float32', 'num_layers': '101', 'epochs': '3', 'learning_rate': '0.0009', 'early_stopping_min_epochs': '1', 'mini_batch_size': '8'}
[01/11/2022 01:34:39 INFO 139782187484992] Final configuration: {'use_pretrained_model': '1', 'num_layers': '101', 'epochs': '3', 'learning_rate': '0.0009', 'lr_scheduler_factor': 0.1, 'optimizer': 'adam', 'momentum': 0, 'weight_decay': 0.0001, 'beta_1': 0.9, 'beta_2': 0.999, 'eps': '1e-07', 'gamma': 0.9, 'mini_batch_size': '8', 'image_shape': '3,405,720', 'precision_dtype': 'float32', 'augmentation_type': 'crop_color_transform', 'num_classes': '3', 'num_training_samples': '150', 'early_stopping': 'True', 'multi_label': '1', 'early_stopping_patience': '2', 'early_stopping_min_epochs': '1'}
[01/11/2022 01:34:39 INFO 139782187484992] label-format is multi-hot
[01/11/2022 01:34:39 INFO 139782187484992] use_pretrained_model: 1
[01/11/2022 01:34:39 INFO 139782187484992] multi_label: 1
[01/11/2022 01:34:39 INFO 139782187484992] Using pretrained model for initializing weights and transfer learning.
[01/11/2022 01:34:39 INFO 139782187484992] ---- Parameters ----
[01/11/2022 01:34:39 INFO 139782187484992] num_layers: 101
[01/11/2022 01:34:39 INFO 139782187484992] data type: <class 'numpy.float32'>
[01/11/2022 01:34:39 INFO 139782187484992] epochs: 3
[01/11/2022 01:34:39 INFO 139782187484992] optimizer: adam
[01/11/2022 01:34:39 INFO 139782187484992] beta_1: 0.9
[01/11/2022 01:34:39 INFO 139782187484992] beta_2: 0.999
[01/11/2022 01:34:39 INFO 139782187484992] eps: 1e-07
[01/11/2022 01:34:39 INFO 139782187484992] learning_rate: 0.0009
[01/11/2022 01:34:39 INFO 139782187484992] num_training_samples: 150
[01/11/2022 01:34:39 INFO 139782187484992] mini_batch_size: 8
[01/11/2022 01:34:39 INFO 139782187484992] image_shape: 3,405,720
[01/11/2022 01:34:39 INFO 139782187484992] num_classes: 3
[01/11/2022 01:34:39 INFO 139782187484992] augmentation_type: crop_color_transform
[01/11/2022 01:34:39 INFO 139782187484992] kv_store: device
[01/11/2022 01:34:39 INFO 139782187484992] checkpoint_frequency not set, will store the best model
[01/11/2022 01:34:39 INFO 139782187484992] Using early stopping for training
[01/11/2022 01:34:39 INFO 139782187484992] Early stopping minimum epochs: 1
[01/11/2022 01:34:39 INFO 139782187484992] Early stopping patience: 2
[01/11/2022 01:34:39 INFO 139782187484992] Early stopping tolerance: 0.01
[01/11/2022 01:34:39 INFO 139782187484992] --------------------
[01:34:39] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_ecl_Cuda_10.1.x.10042.0/AL2_x86_64/generic-flavor/src/src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[01:34:39] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_ecl_Cuda_10.1.x.10042.0/AL2_x86_64/generic-flavor/src/src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[01/11/2022 01:34:41 INFO 139782187484992] Setting number of threads: 31
[01:35:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_ecl_Cuda_10.1.x.10042.0/AL2_x86_64/generic-flavor/src/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[01/11/2022 01:35:11 INFO 139782187484992] Epoch[0] Train-accuracy=0.594907
[01/11/2022 01:35:11 INFO 139782187484992] Epoch[0] Time cost=11.586
[01/11/2022 01:35:12 INFO 139782187484992] Epoch[0] Validation-accuracy=0.333333
[01/11/2022 01:35:13 INFO 139782187484992] Storing the best model with validation accuracy: 0.333333
[01/11/2022 01:35:13 INFO 139782187484992] Saved checkpoint to "/opt/ml/model/image-classification-0001.params"
[01/11/2022 01:35:15 INFO 139782187484992] Epoch[1] Train-accuracy=0.594907
[01/11/2022 01:35:15 INFO 139782187484992] Epoch[1] Time cost=2.182
[01/11/2022 01:35:17 INFO 139782187484992] Epoch[1] Validation-accuracy=0.604167
[01/11/2022 01:35:17 INFO 139782187484992] Storing the best model with validation accuracy: 0.604167
[01/11/2022 01:35:17 INFO 139782187484992] Saved checkpoint to "/opt/ml/model/image-classification-0002.params"
[01/11/2022 01:35:20 INFO 139782187484992] Epoch[2] Train-accuracy=0.581019
[01/11/2022 01:35:20 INFO 139782187484992] Epoch[2] Time cost=2.116
[01/11/2022 01:35:21 INFO 139782187484992] Epoch[2] Validation-accuracy=0.666667
[01/11/2022 01:35:21 INFO 139782187484992] Storing the best model with validation accuracy: 0.666667
[01/11/2022 01:35:22 INFO 139782187484992] Saved checkpoint to "/opt/ml/model/image-classification-0003.params"

And then describe_training_job gives

 'TrainingJobArn': 'arn:aws:sagemaker:ap-southeast-2:950765595897:training-job/image-classification-2022-01-11-01-30-48-551',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://frontlinedatasystems-ml-data/training_jobs/image-classification-2022-01-11-01-30-48-551/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'augmentation_type': 'crop_color_transform',
  'early_stopping': 'True',
  'early_stopping_min_epochs': '1',
  'early_stopping_patience': '2',
  'epochs': '3',
  'eps': '1e-07',
  'image_shape': '3,405,720',
  'learning_rate': '0.0009',
  'mini_batch_size': '8',
  'multi_label': '1',
  'num_classes': '3',
  'num_layers': '101',
  'num_training_samples': '150',
  'optimizer': 'adam',
  'precision_dtype': 'float32',
  'use_pretrained_model': '1'},
 'AlgorithmSpecification': {'TrainingImage': '544295431143.dkr.ecr.ap-southeast-2.amazonaws.com/image-classification:1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:accuracy',
    'Regex': 'Epoch\\S* Train-accuracy=(\\S*)'},
   {'Name': 'validation:accuracy',
    'Regex': 'Epoch\\S* Validation-accuracy=(\\S*)'},
   {'Name': 'train:accuracy:epoch',
    'Regex': 'Epoch\\S* Train-accuracy=(\\S*)'},
   {'Name': 'validation:accuracy:epoch',
    'Regex': 'Epoch\\S* Validation-accuracy=(\\S*)'}],
  'EnableSageMakerMetricsTimeSeries': False},
 'RoleArn': 'arn:aws:iam::950765595897:role/service-role/AmazonSageMaker-ExecutionRole-20210403T100038',
 'InputDataConfig': [{'ChannelName': 'train',
   'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://frontlinedatasystems-ml-data/RDD-13/data/trn/trn.rec',
     'S3DataDistributionType': 'FullyReplicated'}},
   'ContentType': 'application/x-recordio',
   'CompressionType': 'None',
   'RecordWrapperType': 'None',
   'InputMode': 'Pipe'},
  {'ChannelName': 'validation',
   'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://frontlinedatasystems-ml-data/RDD-13/data/val/val.rec',
     'S3DataDistributionType': 'FullyReplicated'}},
   'ContentType': 'application/x-recordio',
   'CompressionType': 'None',
   'RecordWrapperType': 'None',
   'InputMode': 'Pipe'}],
 'OutputDataConfig': {'KmsKeyId': '',
  'S3OutputPath': 's3://frontlinedatasystems-ml-data/training_jobs/'},
 'ResourceConfig': {'InstanceType': 'ml.p3.8xlarge',
  'InstanceCount': 1,
  'VolumeSizeInGB': 100},
 'StoppingCondition': {'MaxRuntimeInSeconds': 360000},
 'CreationTime': datetime.datetime(2022, 1, 11, 1, 30, 48, 841000, tzinfo=tzlocal()),
 'TrainingStartTime': datetime.datetime(2022, 1, 11, 1, 33, 33, 852000, tzinfo=tzlocal()),
 'TrainingEndTime': datetime.datetime(2022, 1, 11, 1, 35, 55, 160000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2022, 1, 11, 1, 40, 37, 74000, tzinfo=tzlocal()),
 'SecondaryStatusTransitions': [{'Status': 'Starting',
   'StartTime': datetime.datetime(2022, 1, 11, 1, 30, 48, 841000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 1, 11, 1, 33, 33, 852000, tzinfo=tzlocal()),
   'StatusMessage': 'Preparing the instances for training'},
  {'Status': 'Downloading',
   'StartTime': datetime.datetime(2022, 1, 11, 1, 33, 33, 852000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 1, 11, 1, 33, 40, 927000, tzinfo=tzlocal()),
   'StatusMessage': 'Downloading input data'},
  {'Status': 'Training',
   'StartTime': datetime.datetime(2022, 1, 11, 1, 33, 40, 927000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 1, 11, 1, 35, 26, 683000, tzinfo=tzlocal()),
   'StatusMessage': 'Training image download completed. Training in progress.'},
  {'Status': 'Uploading',
   'StartTime': datetime.datetime(2022, 1, 11, 1, 35, 26, 683000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 1, 11, 1, 35, 55, 160000, tzinfo=tzlocal()),
   'StatusMessage': 'Uploading generated training model'},
  {'Status': 'Completed',
   'StartTime': datetime.datetime(2022, 1, 11, 1, 35, 55, 160000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 1, 11, 1, 35, 55, 160000, tzinfo=tzlocal()),
   'StatusMessage': 'Training job completed'}],
 'FinalMetricDataList': [{'MetricName': 'train:accuracy',
   'Value': 0.5810189843177795,
   'Timestamp': datetime.datetime(2022, 1, 11, 1, 35, 20, tzinfo=tzlocal())},
  {'MetricName': 'validation:accuracy',
   'Value': 0.6666669845581055,
   'Timestamp': datetime.datetime(2022, 1, 11, 1, 35, 21, tzinfo=tzlocal())},
  {'MetricName': 'train:accuracy:epoch',
   'Value': 0.5810189843177795,
   'Timestamp': datetime.datetime(2022, 1, 11, 1, 35, 20, tzinfo=tzlocal())},
  {'MetricName': 'validation:accuracy:epoch',
   'Value': 0.6666669845581055,
   'Timestamp': datetime.datetime(2022, 1, 11, 1, 35, 21, tzinfo=tzlocal())}],
 'EnableNetworkIsolation': False,
 'EnableInterContainerTrafficEncryption': False,
 'EnableManagedSpotTraining': False,
 'TrainingTimeInSeconds': 142,
 'BillableTimeInSeconds': 142,
 'DebugHookConfig': {'S3OutputPath': 's3://frontlinedatasystems-ml-data/training_jobs/',
  'CollectionConfigurations': [{'CollectionName': 'relu_output',
    'CollectionParameters': {'include_regex': '.*relu_output',
     'save_interval': '500'}},
   {'CollectionName': 'gradients',
    'CollectionParameters': {'save_interval': '500'}},
   {'CollectionName': 'all'}]},
 'DebugRuleConfigurations': [{'RuleConfigurationName': 'DeadRelu',
   'RuleEvaluatorImage': '184798709955.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-debugger-rules:latest',
   'VolumeSizeInGB': 0,
   'RuleParameters': {'rule_to_invoke': 'DeadRelu'}},
  {'RuleConfigurationName': 'VanishingGradient',
   'RuleEvaluatorImage': '184798709955.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-debugger-rules:latest',
   'VolumeSizeInGB': 0,
   'RuleParameters': {'rule_to_invoke': 'VanishingGradient'}}],
 'DebugRuleEvaluationStatuses': [{'RuleConfigurationName': 'DeadRelu',
   'RuleEvaluationJobArn': 'arn:aws:sagemaker:ap-southeast-2:950765595897:processing-job/image-classification-2022--deadrelu-b439637b',
   'RuleEvaluationStatus': 'Error',
   'StatusDetails': 'ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded\nTraceback (most recent call last):\n  File "evaluate.py", line 119, in _create_trials\n    range_steps=(self.start_step, self.end_step))\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 25, in create_trial\n    return LocalTrial(name=name, dirname=path, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__\n    self._load_collections()\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections\n    _wait_for_collection_files(1)  # wait for the first collection file\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files\n    raise MissingCollectionFiles\nsmdebug.exceptions.MissingCollectionFiles: Trainin',
   'LastModifiedTime': datetime.datetime(2022, 1, 11, 1, 40, 37, 68000, tzinfo=tzlocal())},
  {'RuleConfigurationName': 'VanishingGradient',
   'RuleEvaluationJobArn': 'arn:aws:sagemaker:ap-southeast-2:950765595897:processing-job/image-classification-2022--vanishinggradient-1c13d10e',
   'RuleEvaluationStatus': 'Error',
   'StatusDetails': 'ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded\nTraceback (most recent call last):\n  File "evaluate.py", line 119, in _create_trials\n    range_steps=(self.start_step, self.end_step))\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 25, in create_trial\n    return LocalTrial(name=name, dirname=path, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__\n    self._load_collections()\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections\n    _wait_for_collection_files(1)  # wait for the first collection file\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files\n    raise MissingCollectionFiles\nsmdebug.exceptions.MissingCollectionFiles: Trainin',
   'LastModifiedTime': datetime.datetime(2022, 1, 11, 1, 40, 37, 68000, tzinfo=tzlocal())}],
 'ProfilerConfig': {'S3OutputPath': 's3://frontlinedatasystems-ml-data/training_jobs/',
  'ProfilingIntervalInMilliseconds': 500},
 'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport-1641864648',
   'RuleEvaluatorImage': '184798709955.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-debugger-rules:latest',
   'VolumeSizeInGB': 0,
   'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}],
 'ProfilerRuleEvaluationStatuses': [{'RuleConfigurationName': 'ProfilerReport-1641864648',
   'RuleEvaluationJobArn': 'arn:aws:sagemaker:ap-southeast-2:950765595897:processing-job/image-classification-2022--profilerreport-1641864648-62bf0af2',
   'RuleEvaluationStatus': 'NoIssuesFound',
   'LastModifiedTime': datetime.datetime(2022, 1, 11, 1, 36, 10, 144000, tzinfo=tzlocal())}],
 'ProfilingStatus': 'Enabled',
 'ResponseMetadata': {'RequestId': '99b4ca0a-0536-4ab2-97b7-81ed96b99f01',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '99b4ca0a-0536-4ab2-97b7-81ed96b99f01',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '8260',
   'date': 'Tue, 11 Jan 2022 01:41:40 GMT'},
  'RetryAttempts': 0}}

It seems that nothing gets saved at all, any clue as to what is going on would be great. I'm in ap-southeast-2 in case that helps.

Lewington-pitsos commented 2 years ago

@jkroll-aws This could be a red herring, but none of the examples I can find of debugger hooks being used make use of sagemaker.estimator.Estimator (which is what I am using), it's always sagemaker.pytorch.PyTorch or something else like that. Perhaps there is some nuance with sagemaker.estimator.Estimator.fit that I'm not aware of?

Lewington-pitsos commented 2 years ago

I can confirm that the profiler is working as expected.

Lewington-pitsos commented 2 years ago

I have managed to replicate the issue with some existing aws code, namely if you run this notebook (which is just https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/imageclassification_mscoco_multi_label/Image-classification-multilabel-lst.ipynb but with some rules added) the rules won't save any collections and you'll get ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job

To me this looks like a bug.

Lewington-pitsos commented 2 years ago

@juliensimon if you happen to have time

aws / amazon-sagemaker-examples

[Bug Report] Rule - class_imbalance on XGBoost #2746