aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

HyperparameterTuner and Experiments UI not showing data #3893

Open fjpa121197 opened 1 year ago

fjpa121197 commented 1 year ago

Describe the bug SageMaker Experiments console, showing not data to create charts for metrics defined in job.

I'm currently running a HPO jobs, created by using the HyperparameterTuner object and Tensorflow estimator. This is the code portion for creating the HPO job, and estimator to be used:


objective_metric_name = 'test loss'
objective_type = 'Minimize'
metric_definitions = [
    {
        "Name": "test loss",
        "Regex": "Test loss: ([0-9\\.]+)",
    },
    {
        "Name": "train loss",
        "Regex": "categorical_crossentropy: ([0-9\\.]+)",
    },
    {
        "Name": "val loss",
        "Regex": "val_categorical_crossentropy: ([0-9\\.]+)",
    },

]

ts = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

experiment_name = 'DM-AMT-exp-' + ts

job_name=f'DM-exp-amt-{ts}'

trials_output_path = output_path + '/' + experiment_name
code_location_output_path = output_path + '/' + experiment_name

tf_estimator = TensorFlow(entry_point               = 'entrypoint-amt.py',
                          source_dir                = 'src',
                          output_path               = trials_output_path,
                          code_location             = code_location_output_path,
                          role                      = role,
                          metric_definitions        = metric_definitions,
                          instance_count            = 1,
                          enable_sagemaker_metrics  = True,
                          instance_type             = 'ml.m5.4xlarge',
                          framework_version         ='2.2',
                          py_version                ='py37',)

tuner = HyperparameterTuner(estimator             = tf_estimator,
                            objective_metric_name = objective_metric_name,
                            hyperparameter_ranges = hyperparameter_ranges,
                            metric_definitions    = metric_definitions,
                            max_jobs              = 2,
                            max_parallel_jobs     = 2,
                            objective_type        = objective_type,
                            random_seed           = 14
        )

tuner.fit(processed_data_path,
            job_name = job_name,)

I wait for the training of the jobs to finished, and they appeared in the Experiments console (in SageMaker studio). These are the metrics for one of the two jobs:

image

However, when trying to create a chart, to see what is the train loss over the epochs, I get a message that there is not data.

image

When I look at the training job settings, in the SageMaker console, I see that the "SageMaker metrics time series" is disabled, eventhough in my estimator, Tensorflow, I have it as True.

Not sure why the estimator configuration is not kept, when using the HyperParameterTuner object. When calling the .fit() method from the estimator, it keeps the enable_sagemaker_metrics = True.

System information A description of your system. Please provide:

mufaddal-rohawala commented 1 year ago

@fjpa121197 thanks for reaching out sagemaker! It seems like we are setting enable_sagemaker_metrics = True when calling create_training_job api from SageMaker PySDK. Can you provide SageMaker training job ARN for further debug?

mikaelrobomagi commented 1 year ago

I have the same problem. Is it safe to display the ARN here so that you can debug? Any other information that you need?