aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
125 stars 36 forks source link

Experiment provides function to load artifacts and parameters during sagemaker training jobs #124

Closed BaoshengHeTR closed 2 years ago

BaoshengHeTR commented 3 years ago

load artifacts during training jobs Tracker is created to upload/track experiment meta data, i.e., it is contructed before we call estimator.fit(**kwargs) job, and so does the upload of metadata. However, some results are only available when a model is trained by estimator.fit(**kwargs) job. Approches to load artifacts during the training job would be helpful. Describe the solution you'd like In endpoint script train.py used to create estimator, if we can call:

import pandas as pd
import pytorch
...
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    ....
    tracker.load_artifact('report.txt')

By doing so, artifacts generated during the training job can be tracked.

Describe alternatives you've considered

Additional context I know tracker provides methods like log_confusion_matrix, .etc. However, I think these methods assume we deploy the model (i.e., after the training job is done) and run inference and then log the results. But consider such a scenario that we want to call load_artifact to upload the evaluation results on dev dataset; since the evaluation should be done in the training job, so we don't need to redo it after the training job.

jordan-melendez commented 3 years ago

Is it really true that log_confusion_matrix, etc., do not work in training jobs? I've been debugging for hours wondering why these methods weren't returning anything in my training jobs. If this is the problem, then these docs are very misleading, as it makes it appear that a training job generated the output charts. Furthermore, these docs state that

Note that this method must be run from a SageMaker context such as studio or training job due to restrictions on the CreateArtifact API.

which also implies that they could be run from a training job. It would be great if these charts and other artifacts could be generated/logged from within training jobs.

danabens commented 3 years ago

Hi Jordan,

Approches to load artifacts during the training job would be helpful.

You should be able to call Tracker.load() in train.py and then you can call tracker methods. In non-training job host contexts you can still use the Tracker but need to supply the training job name, for example: Tracker.load(training_job_name=...) but keep in mind log_metric will not work outside of a training host.

tracker.load_artifact('report.txt')

I assume you mean tracker.upload_artifact ?

to upload the evaluation results on dev dataset; since the evaluation should be done in the training job, so we don't need to redo it after the training job

You should be able to upload artifacts during a training job, including a confusion matrix Also, some pipelines may break the training and evaluation steps into two different steps, for example Orchestrating Jobs with Amazon SageMaker Model Building Pipelines.

Is it really true that log_confusion_matrix, etc., do not work in training jobs?

It does work in training job. Are you using Tracker.load() in your train.py ?

weren't returning anything in my training jobs.

log_confusion_matrix will create a lineage Artifact entity with an association to the Trial Component that represents the training job. For example you should be able to see the created association by listing output associations: Associations.list(source_arn=<trial component arn>)

Note that this method must be run from a SageMaker context such as studio or training job due to restrictions on the CreateArtifact API.

This limitation was lifted recently, Ill update that documentation.

It would be great if these charts and other artifacts could be generated/logged from within training jobs.

If you could send me some psuedo code on what you are doing in your train.py that could be great. Normally Tracker works during the training job.

danabens commented 3 years ago

If this is the problem, then these docs are very misleading, as it makes it appear that a training job generated the output charts.

Ya, I can see that all it says is The graphs were produced using the Tracker APIs. which doesn't give the full picture.