aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
126 stars 36 forks source link

Training metrics not being recorded in Sagemaker Experiments #169

Open santoshmedisetty opened 1 year ago

santoshmedisetty commented 1 year ago

Hi,

I'm training a YOLOv5 model on sagemaker. I've created an Experiment and Trial for training the model. But the training metrics like precision, recall, mAP, etc are not being recorded in the Sagemaker.

I've followed the process similar to https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-experiments/mnist-handwritten-digits-classification-experiment/mnist-handwritten-digits-classification-experiment.ipynb

Is it a problem with the IAM role or something like that?

I'm triggering the training process using 'Estimator' as shown below.

yolov5_experiment = Experiment.create( experiment_name=f"yolov5-training-job-{timenow}", description="yolov5n model training", sagemaker_boto_client=sm, )

yolov5_training_job_name = f'yolov5-training-job-{timenow}'

trial_name = f"yolov5-training-job-{timenow}" yolov5_trial = Trial.create( trial_name=trial_name, experiment_name=yolov5_experiment.experiment_name, sagemaker_boto_client=sm, )

estimator = Estimator( image_uri=container, role=role, instance_count=1, instance_type='ml.m4.xlarge',

instance_type='local',

input_mode='File',
output_path=outpath,
base_job_name='yolov5',
sagemaker_session=sagemaker.Session(sagemaker_client=sm),
metric_definitions=[
{'Name': 'metrics/mAP_0.5', "Regex": "metrics/mAP_0.5: (.*?);"},
{'Name': 'metrics/mAP_0.5:0.95', "Regex": "metrics/mAP_0.5:0.95: (.*?);"},
{'Name': 'metrics/recall', "Regex": "metrics/recall: (.*?);"},
{'Name': 'metrics/precision', "Regex": "metrics/precision: (.*?);"},
{'Name': 'train/box_loss', "Regex": "train/box_loss: (.*?);"},
{'Name': 'train/cls_loss', "Regex": "train/cls_loss: (.*?);"},
{'Name': 'train/obj_loss', "Regex": "train/obj_loss: (.*?);"},
{'Name': 'val/cls_loss', "Regex": "val/cls_loss: (.*?);"},
{'Name': 'val/obj_loss', "Regex": "val/obj_loss: (.*?);"},
{'Name': 'val/box_loss',"Regex": "val/box_loss: (.*?);"},
{'Name': 'Epoch', "Regex": "Epoch: (.*?);"}
],
enable_sagemaker_metrics=True,

)

estimator.fit(inputs,job_name=yolov5_training_job_name, experiment_config={ "ExperimentName": yolov5_experiment.experiment_name, "TrialName": yolov5_trial.trial_name, "TrialComponentDisplayName": "Training", }, wait=True,)