aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
127 stars 36 forks source link

Additional parameter validation for `Tracker` logging methods. #126

Closed jmahlik closed 1 year ago

jmahlik commented 3 years ago

Describe the bug When the tracker logs something that is not JSON compliant, it causes training jobs to hang and eventually die. This is likely due to a boto request not being able to be sent. The biggest offenders are things like np.nan, None, np.inf, -np.inf. They can be quite common in model building if a hyperparameter causes an odd metric value. I think this might become especially problematic using log_table.

Current workaround is to write validation code around the metric logging. It's become cumbersome to copy it or have jobs fail for forgetting to copy it.

I propose adding some additional parameter validation to the tracker itself. Happy to submit a PR for this but wanted to get any thoughts before undertaking it.

Side note to anyone who might read this, isinstance(np.nan, numbers.Number) = True. So, it will likely have to be some JSON compatibility check vs. isinstance checking.

To Reproduce Steps to reproduce the behavior: Log np.nan or None as the value in log_metric or log_parameter.

Expected behavior Either raise an error or throw out the metric. Throwing away the metric with a warning or logging some kind of default or blank value would be ideal.

Environment: Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): BYOC Framework Version: N/A Python Version: python 3.8/3.9 CPU or GPU: CPU Python SDK Version: Are you using a custom image: Yes

danabens commented 3 years ago

https://github.com/aws/sagemaker-experiments/pull/125

jmahlik commented 3 years ago

Any thoughts on the best approach here? I was going to start on a PR soon for log_parameter and log_metric. IMO the best approach might be to not log values to the tracker and have a standard library logger log a warning like f"{metric_name} was {metric_value}, metric not logged to experiment". Then when looking at experiment analytics, the metric just shows as na for the trial component.

There might be other resolutions that I'm not aware of though.

krtin commented 3 years ago

I am unable to log_parameters in Tracker with boolean types, is this a known issue?

sagemaker version: 0.1.30


with Tracker.create(display_name="example", sagemaker_boto_client=sm_boto3) as tracker: 

    tracker.log_parameters(
        {example_boolean_param: True}
    ) 

Error: ClientError: An error occurred (400) when calling the UpdateTrialComponent operation:

jerrypeng7773 commented 3 years ago

https://github.com/aws/sagemaker-experiments/blob/main/src/smexperiments/tracker.py#L229

the parameter value the log_parameters accepts are string and number, so it should work if you pass in "true"

jerrypeng7773 commented 3 years ago

Any thoughts on the best approach here? I was going to start on a PR soon for log_parameter and log_metric. IMO the best approach might be to not log values to the tracker and have a standard library logger log a warning like f"{metric_name} was {metric_value}, metric not logged to experiment". Then when looking at experiment analytics, the metric just shows as na for the trial component.

There might be other resolutions that I'm not aware of though.

Hi jmahlik, thanks for the suggestion, and your idea sounds reasonable, I wonder if you already submitted the PR yet?

krtin commented 3 years ago

https://github.com/aws/sagemaker-experiments/blob/main/src/smexperiments/tracker.py#L229

the parameter value the log_parameters accepts are string and number, so it should work if you pass in "true"

Okay, will try but ideally a python bool should also work as it is a valid Number type, but it does not.

jmahlik commented 3 years ago

Any thoughts on the best approach here? I was going to start on a PR soon for log_parameter and log_metric. IMO the best approach might be to not log values to the tracker and have a standard library logger log a warning like f"{metric_name} was {metric_value}, metric not logged to experiment". Then when looking at experiment analytics, the metric just shows as na for the trial component. There might be other resolutions that I'm not aware of though.

Hi jmahlik, thanks for the suggestion, and your idea sounds reasonable, I wonder if you already submitted the PR yet?

Kind of hit a wall because it is hard to test what values are valid and what will fail without waiting for it to time out. I haven't had time to work on it further.

jmahlik commented 1 year ago

161