aws / sagemaker-tensorflow-training-toolkit

Toolkit for running TensorFlow training scripts on SageMaker. Dockerfiles used for building SageMaker TensorFlow Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
270 stars 160 forks source link

How to get evaluation metrics in output logs #392

Open MelissaKR opened 4 years ago

MelissaKR commented 4 years ago

Hi,

This is my first time working with Sagemaker. I successfully trained a model, however, I'm having difficulty getting it to output evaluation metrics to the log files.

Here is a snippet of my model:

def metric_fn(label_ids, predicted_labels):
    accuracy = tf.compat.v1.metrics.accuracy(label_ids, predicted_labels)
    recall = tf.compat.v1.metrics.recall(label_ids,predicted_labels)
    precision = tf.compat.v1.metrics.precision(label_ids,predicted_labels) 

    return {"eval_accuracy": accuracy,
            "precision": precision,
            "recall": recall}
if mode== tf.estimator.ModeKeys.EVAL:
      eval_metrics = metric_fn(label_ids, predicted_labels)
      return tf.estimator.EstimatorSpec(mode=mode,loss=loss,eval_metric_ops=eval_metrics)

And this is how the model is fit:

estimator = TensorFlow(
    entry_point='script.py',
    source_dir = [#Source_dir],
    train_instance_type='ml.m5.2xlarge',
    train_instance_count=4,
    output_path=s3_output_location,
    hyperparameters=hyperparameters,
    role=role,
    py_version='py3',
    framework_version='1.15.2',
    sagemaker_session=sess,
    metric_definitions=[{'Name': 'eval-accuracy', 'Regex': 'eval-accuracy=(\d\.\d+)'},
                        {'Name': 'precision', 'Regex': 'precision=(\d\.\d+)'},
                        {'Name': 'recall', 'Regex': 'recall=(\d\.\d+)'}],
    enable_sagemaker_metrics=True,
    distributions= {'parameter_server': {'enabled': True}})

When the training finishes, I don't see any of these metrics in the logs, nor in the 'training jobs' section. This is how the Metrics section looks:

Metrics Name Regex eval-accuracy eval-accuracy=(\d.\d+) precision precision=(\d.\d+) recall recall=(\d.\d+)

I don't know why it should be so obscure. I've run the script multiple times with sagemaker, and no luck so far! I'd appreciate any help!

metrizable commented 4 years ago

@MelissaKR thanks for filing the issue. I noticed that your metric definition regex seeks to match eval-accuracy which differs slightly from the dict key eval_accuracy your metric_fn returns for your EstimatorSpec. Is this difference intentional?

On a side note, you mentioned that you "don't see any of these metrics in the logs". Could you clarify?

MelissaKR commented 4 years ago

@metrizable Thank you for your input. I will correct the difference in the accuracy metric. I was generally wondering where I can track model outputs for these metrics. I thought they'll be written out to the logs, or they'll show up in the "Metrics" section for the training job.

laurenyu commented 4 years ago

sorry for the delayed response here. The metrics should be viewable in CloudWatch - scroll down to the "Monitor" section in the AWS console when looking at a training job.

docs: https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html

Miles1996 commented 3 years ago

Hi @MelissaKR did you manage to resolve this? I am having the same issue

lhideki commented 2 years ago

I had a similar situation, but the cause was IAM Policy permissions; checking CloudWatch/Logs permissions may help.