aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
126 stars 36 forks source link

log_metrics does not appear to work #151

Closed Lewington-pitsos closed 2 years ago

Lewington-pitsos commented 2 years ago

I have nearly exactly the same issue as @athewsey had originally in Issue #73.

I have been trying for several hours to save experiments, trials and trial components in various orders such that log_metrics actually logs any metrics. I am calling log_metrics from a tracker created using load rather than create, inside a training job and no warnings are printed, but no matter what I do aws sagemaker studio and sagemaker experiments api seem unable to retrieve these metrics later (though parameters and artifacts are certainly logged).

@danabens can you provide a code snippet or the full code you ran before august 8 2020 that indicated to you that metrics are working as intended? This would possibly allow me to determine the source of my issue.

Presently I am unable to find any similar code outlining the intended log_metrics workflow in either this repo or in amazon-sagemaker-examples.

ghost commented 2 years ago

I am experiencing exactly the same issue with Tracker.log_metrics from inside a training job

ghost commented 2 years ago

For reference, I got this to work by setting enable_sagemaker_metrics=True inside the Estimator init. The documentation around this is really quite poor, it would be helpful for users to be able to work this out without reading the source code and/or guessing

danabens commented 2 years ago

unable to retrieve these metrics later

Looks like swattstgt identified the root cause of enable_sagemaker_metrics on the Estimator not being set.

Presently I am unable to find any similar code outlining the intended log_metrics workflow in either this repo or in amazon-sagemaker-examples.

Ya, will add an example notebook.

The documentation around this is really quite poor

Ya the behavior of enable_sagemaker_metrics is complex and there is no reference to the relationship between this parameter and log_metric in the Tracker. Will update docs. For reference: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html#sagemaker-Type-AlgorithmSpecification-EnableSageMakerMetricsTimeSeries

lorenzwalthert commented 2 years ago

In addition to training jobs, it would be very useful if metrics could also be logged from sagemaker processing jobs. I note that the Sagemaker API has been opened to log from anywhere except log_metrics(() according to https://github.com/aws/sagemaker-experiments/issues/142.

jlloyd-widen commented 2 years ago

I too have this same issue. I have set the enable_sagemaker_metrics=True but still no luck. My "pipeline" script has the following. I'm running this currently in local mode (i.e., instance_type='local'), which I worry is triggering this warning WARNING:root:Cannot write metrics in this environment. but doesn't really make sense since it's running in Sagemaker's SKLearn container:

sk_model = SKLearn(
    source_dir="src/",
    entry_point="training/model.py",
    role=sagemaker.get_execution_role(),
    framework_version="0.23-1",
    instance_count=1,
    instance_type=instance_type,
    output_path=model_s3_uri,
    code_location=code_s3_uri,
    base_job_name=model_id,
    enable_sagemaker_metrics=True,
    environment={"MODEL_ID": model_id},
    tags=tags,
)

and my entry_point code looks like the following:

    with Tracker.create(display_name="evaluation", sagemaker_boto_client=sm) as tracker:
        tracker.log_metric(metric_name="best_cv_score", value=cv_best_score, timestamp=t,)
        tracker.log_metric(metric_name="score", value=scor, timestamp=t)
        tracker.log_confusion_matrix(y_test, predictions, title="conf-mtrx")
        tracker.log_metric(metric_name="roc", value=roc, timestamp=t)
        tracker.log_roc_curve(y_test, predictions, title="roc-curve")

    Trial.load(trial_name=model_id).add_trial_component(tracker.trial_component)

I can see the evaluation trial component in the sagemaker UI but there is nothing logged inside of it. Any form of guidance would be useful.

danabens commented 2 years ago

In addition to training jobs, it would be very useful if metrics could also be logged from sagemaker processing jobs. I note that the Sagemaker API has been opened to log from anywhere except log_metrics(() according to https://github.com/aws/sagemaker-experiments/issues/142.

@lorenzwalthert - Can you provide some additional detail on your use case for metrics in processing jobs? Create a new issue in this repo. Thanks.

danabens commented 2 years ago

but doesn't really make sense since it's running in Sagemaker's SKLearn container:

@jlloyd-widen the Tracker.log_metric requires an agent running on the training host which ingests metrics into SageMaker from the file which log_metric writes to. log_metric doesn't work in local mode because the metric agent isn't present in the local container. Inability to log metrics to SageMaker from local/non-sagemaker environments is a known limitation we are investigating.