aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

Closed lorenzwalthert closed 10 months ago

lorenzwalthert commented 10 months ago

Describe the bug

Running a Sagemaker Pipeline with a tune step in which the estimator used is in script mode fails due to a lacking metrics definition in the pipeline definition, albeit it was passed to the estimator upon pipeline creation with the SDK.

This means a pipeline created and upserted with the Python SDK with a tuning step that contains a script mode estimator can't run.

I opened a support case with AWS premium support (170258349801168). The below code snippets are a simplified version.

To reproduce

I have a sagemaker pipeline with a hyper-parameter tuning step based on the SKLearn script mode estimator. The relevant snippet

metrics_mapping =  [
   {
        "Name": "test:lift-momentum",
        "Regex": "Lift \\(candidate all classes\\) over momentum: (.*?);",
    },
]
    model = sagemaker.sklearn.SKLearn(
        entry_point="step_train.py",
        role=role,
        instance_type=instance_type,
        instance_count=1,
        image_uri=image_uri,
        image_uri_region="eu-central-1",
        metric_definitions=metrics_mapping,
        hyperparameters=hyper_params,
        enable_sagemaker_metrics=True,
        sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
    )

    inputs = {
        "all": sagemaker.TrainingInput(manifest_uri, s3_data_type="ManifestFile"),
    }
    tune = sagemaker.tuner.HyperparameterTuner(
        estimator=model,
        objective_metric_name="test:lift-random",
        objective_type="Maximize",
        hyperparameter_ranges=hyperparameter_ranges,
        max_jobs=2,
        max_parallel_jobs=1,
    )
    tune_args = tune.fit(inputs=inputs, wait=True)

    sagemaker.workflow.steps.TuningStep(
        name="TrainModel", step_args=tune_args
    )

I clearly pass a metrics definition into the estimator.

The pipeline can be upserted and started, but I get the following runtime error in the hyper-parameter tuning step (even before the Hyper-parameter tuning job is created in the sagemaker console):

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: A metric is required for this hyperparameter tuning job objective. Provide a metric in the metric definitions.

As expected, to create a hyper-parameter tuning job with CreateHyperParameterTuningJob the key TrainingJobDefinitions must contain a MetricDefinition. The following snippet from the docs:

"TrainingJobDefinitions": [ 
      { 
         "AlgorithmSpecification": { 
            "AlgorithmName": "string",
            "MetricDefinitions": [ 
               { 
                  "Name": "string",
                  "Regex": "string"
               }
            ],
            "TrainingImage": "string",
            "TrainingInputMode": "string"
         },
         ...

However, my pipeline definition file did not contain it, although I passed metric_definitions into the estimator. I believe it must be lost on the way. In my previous version of the pipeline, I used a train step instead of a tune step, and the metrics made it into the pipeline definition file. Here the relevant excerpt:

{
    "Name": "train_model",
    "Type": "Training",
    "Arguments": {
        "AlgorithmSpecification": {
            "TrainingInputMode": "File",
            "TrainingImage": "982361546614.dkr.ecr.eu-central-1.amazonaws.com/signals.processor:staging",
            "MetricDefinitions": [
                {
                    "Name": "train:loss",
                    "Regex": "train loss: (.+?),"
              ...

Expected behavior

The Python SDK takes into account the metrics_definition key supplied to the SKLearn estimator when creating the tune step of the pipeline.

System information A description of your system. Please provide:

lorenzwalthert commented 10 months ago

It seems that a metric definition has to be supplied in sagemaker.tuner.HyperparameterTuner instead of the estimator. Thanks AWS support for providing the solution.