aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
125 stars 36 forks source link

smexperiments.tracker.Tracker exhibited inconsistent recording behavior in SageMaker Describe Trial component #147

Closed iDataist closed 2 years ago

iDataist commented 2 years ago

Describe the bug A clear and concise description of what the bug is. smexperiments.tracker.Tracker exhibited inconsistent behavior in terms of whether records appear in Describe Trial component of SageMaker Studio.

I was able to record artifacts and parameters 11 days ago.

artifact parameters

However, I was not able to see the recorded artifacts and parameters when I executed the same code (see the To Reproduce section) today.

 new-artifact new-parameters

Metric tracking resulted in a warning - "WARNING:root:Cannot write metrics in this environment". If I want to record any metrics to SageMaker Studio (see the image below), what arguments should be passed to tracker.log_metric()?

metrics

To Reproduce Steps to reproduce the behavior:

import sys
!{sys.executable} -m pip install sagemaker-experiments==0.1.31 matplotlib

import boto3
from sagemaker import get_execution_role
from smexperiments.experiment import Experiment
from smexperiments.tracker import Tracker
from smexperiments.trial import Trial
import time

sess = boto3.Session()
sm = sess.client("sagemaker")

experiment = Experiment.create(
    experiment_name="test-exp", 
    sagemaker_boto_client=sm)
first_trial = Trial.create("test-exp")
tracker = Tracker.create(display_name=f"test-tracker-{int(time.time())}", 
                    sagemaker_boto_client=sm)
first_trial.add_trial_component(tracker)

inputs = 's3://data'
outputs = 's3://model.pickle'

tracker.log_input(name="data", media_type="s3/uri", value=inputs)
tracker.log_output(name="model-uri", media_type="s3/uri", value=outputs)
tracker.log_parameters({
    "K": 10, 
    "LEARNING_RATE": 0.25, 
    "NO_COMPONENTS": 21, 
    "NO_EPOCHS": 20, 
    "NO_THREADS": 32, 
    "ITEM_ALPHA": 1e-06, 
    "USER_ALPHA": 1e-06, 
    "MAX_SAMPLED": 13, 
    "NUM_EPOCHS": 39,
    "SEEDNO": 42})

tracker.log_metric(metric_name="AUC", value=0.9645113)
tracker.log_metric(metric_name="Precision@10", value=0.062067)
tracker.log_metric(metric_name="Recall@10", value=0.079707)

metrics = {
    "AUC": [0.9645113],
    "Precision@10": [0.062067],
    "Recall@10": [0.079707]

}
tracker.log_table('MetricData', metrics)

Expected behavior I expect to see artifacts, parameters, and metrics in Describe Trial Component of SageMaker Studio after I executed the code.

Screenshots If applicable, add screenshots to help explain your problem.

Environment: SageMaker Studio Environment Instance type: ml.t3.medium Kernel: Python 3 (Data Science)

Additional context I'm buiding a recommender with the LightFm library. I want to hard code the artifacts, parameters and metrics to the SageMaker Studio environment through Describe Trial Component. In other words, I'm not passing the experiment_config argument when I fit the model.

Any advice would be greatly appreciated.

ManuelMartinG commented 2 years ago

This exact behavior is happening to me too.

danabens commented 2 years ago

However, I was not able to see the recorded artifacts and parameters when I executed the same code (see the To Reproduce section) today.

looks like you need to call tracker.close()

WARNING:root:Cannot write metrics in this environment

log_metric will only work when called from a training job host. see docs. Specifically, log_metric just logs metrics to a file and relies on a process running on the training job host to ingest into SageMaker. When you call log_metric from a non-training job host environment you get this warning.

If I want to record any metrics to SageMaker Studio (see the image below), what arguments should be passed to tracker.log_metric()?

From your screenshots its likely that your code never successfully wrote metrics. So the issue is not how you are calling log_metric but which context you are calling it in (Studio instead of Training Job host).

danabens commented 2 years ago

a with statement will call close when it exits:

import sys
!{sys.executable} -m pip install sagemaker-experiments==0.1.31 matplotlib

import boto3
from sagemaker import get_execution_role
from smexperiments.experiment import Experiment
from smexperiments.tracker import Tracker
from smexperiments.trial import Trial
import time

sess = boto3.Session()
sm = sess.client("sagemaker")
unique_id = int(time.time())

experiment_name = f"test-exp-{unique_id}"

experiment = Experiment.create(
    experiment_name=experiment_name, 
    sagemaker_boto_client=sm)
first_trial = Trial.create(experiment_name)
tracker_name = f"test-tracker-{unique_id}"

# when the with block exits tracker.close() will be called resulting in an UpdateTrialComponent API call
with Tracker.create(
        display_name=tracker_name, 
        sagemaker_boto_client=sm) as tracker:
    first_trial.add_trial_component(tracker)

    inputs = 's3://data'
    outputs = 's3://model.pickle'

    tracker.log_input(name="data", media_type="s3/uri", value=inputs)
    tracker.log_output(name="model-uri", media_type="s3/uri", value=outputs)
    tracker.log_parameters({
        "K": 10, 
        "LEARNING_RATE": 0.25, 
        "NO_COMPONENTS": 21, 
        "NO_EPOCHS": 20, 
        "NO_THREADS": 32, 
        "ITEM_ALPHA": 1e-06, 
        "USER_ALPHA": 1e-06, 
        "MAX_SAMPLED": 13, 
        "NUM_EPOCHS": 39,
        "SEEDNO": 42})

    # log_metric has no effect when called from Studio, needs Training Job host
    #tracker.log_metric(metric_name="AUC", value=0.9645113)
    #tracker.log_metric(metric_name="Precision@10", value=0.062067)
    #tracker.log_metric(metric_name="Recall@10", value=0.079707)

    metrics = {
        "AUC": [0.9645113],
        "Precision@10": [0.062067],
        "Recall@10": [0.079707]

    }
    tracker.log_table('MetricData', metrics)
print(f'Trail Component {tracker_name} updated.')