Using ParallelRunStep output as an input to another step

alexszym commented 2 years ago

Using ParallelRunStep output as an input to another step

This is specific to Python SDK.

I'm attempting to use ParallelRunStep output as an input to another step which I haven't been able to see an example of anywhere. My use case is simple, I wish to save the output of the pipeline with some additional transforms as a csv.

The closest example I've been able to find is here: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-batch-scoring-classification#download-and-review-output, but it has a bug, the delimiter for the "parallel_run_step.txt" is a whitespace not a colon.

Here is code that I managed to get working eventually in my transform.py that's run after ParallelRunStep

transform.py

import pandas as pd
import os
import argparse
from azureml.core import Run

parser = argparse.ArgumentParser(description="Transform")
parser.add_argument('--output_path', dest="output_path", required=True)

args, _ = parser.parse_known_args()

run = Run.get_context()
input_dir = run.input_datasets["input_data"]

input_data_path = os.path.join(input_dir,"parallel_run_step.txt")

input_df = pd.read_csv(input_data_path, delimiter=" ", header=None)
# Transform
transformed_df.to_csv(os.path.join(args.output_path,"processed_data.csv"))

PythonScriptStep

    transform_step = PythonScriptStep(
        source_directory=src_dir,
        name="transform",
        script_name="transform.py",
        compute_target=compute_target,
        runconfig=aml_run_config,
        inputs=[parallel_step_output.as_input('input_data') ],
        arguments=["--output_path", saved_output],

Would be great to provide a similar example in the documentation and fix the issue with the wrong delimiter in the already provided examples

ylnhari commented 1 year ago

@cloga is anyone working on this issue. I have experience working with both SDK versions and would like to contribute by adding more comprehensive examples to the documentation. In my experience, I've found that the learning curve for the V2 SDK is a bit steep. For instance, in V1, to upload some files using datastore, we can use the upload_files_to_datastore() method, but in V2, we need to use azure.storage.blob and BlobServiceClient to directly upload to blob container, which took some time to figure out. I suggest creating notebooks that compare how to perform tasks in each version to help users transition more easily. What are your thoughts on this?

cloga commented 1 year ago

@alainli0928 for help.

alainli0928 commented 1 year ago

@alexszym I would assume you've used 'append_row' output of your ParallelRunStep as the input of your 2nd transformation PyScriptStep. If so, current V1 SDK doesn't support custom output file format. But we knew that users are looking for this capability. In the following release of V2 SDK, we will support new attributes of parallel job to let user: 1) custom append row output format, 2) predefined append row output headers. But please note, all new features of parallel job don't plan to apply to v1 SDK by default. So please consider to try out our V2 experience. Here is our V2 SDK example repository: https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/parallel

Azure / azureml-examples

Using ParallelRunStep output as an input to another step #961

Using ParallelRunStep output as an input to another step