Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

[azureml python sdk v2] access files in URI_FOLDER output after job has finished? #1891

Open movingabout opened 1 year ago

movingabout commented 1 year ago

I have a training job that persists some files in an URI_FOLDER output. How can I access those through the v2 SDK API after the job has finished?

1. job setup

The output is set up like this in the command:

job = command(
    # ...
    outputs=dict(
        outputs=Output(type=AssetTypes.URI_FOLDER, mode='rw_mount'),
    ),
    command="python training_script.py " + 
            "--outputs_dir ${{outputs.outputs}} " +
            # ...other arguments...
)

This seems to work fine, the corresponding folder is mounted correctly and accessible in the training script.

2. training script

In the training script, I persist a dataframe like this:

parser.add_argument("--outputs_dir", dest="outputs_dir", default=DEFAULT_MODEL_DIR)
# ...
some_dataframe.to_csv(os.path.join(args.outputs_dir, 'some_dataframe.csv'), index=True)

This works fine.

3. resulting dataset

After the job has finished, the outputs are available as a dataset. This is what is shown in Azure ML Studio in the "Overview" tab for job ivory_octopus_yd6by49kxf:

image

The dataset is successfully stored in the workspaceblobstore datastore. I checked it in the Azure ML Studio and it looks fine.

4. accessing the persisted data

After the job has finished, I access the run using a MlflowClient()

MLFLOW_TRACKING_URI = ml_client.workspaces.get(name=ml_client.workspace_name).mlflow_tracking_uri
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow_client = MlflowClient()

mlflow_run = mlflow_client.get_run("ivory_octopus_yd6by49kxf")

or

run = ml_client.jobs.get('ivory_octopus_yd6by49kxf')
# returns NodeOutput class

How can I programmatically list / get / download the outputs connected to the job?

Thanks!