Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

Pipeline parameters used with DataPath and DataPathComputeBinding to specify side inputs of Parallel pipeline #1801

Open pop134 opened 2 years ago

pop134 commented 2 years ago

[Enter feedback here] I'm following this example to create a PipelineParameters for my Parallel pipeline

from azureml.core.datastore import Datastore
from azureml.data.datapath import DataPath, DataPathComputeBinding
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import PipelineParameter

datastore = Datastore(workspace=workspace, name="workspaceblobstore")
datapath = DataPath(datastore=datastore, path_on_datastore='input_data')
data_path_pipeline_param = (PipelineParameter(name="input_data", default_value=datapath),
                           DataPathComputeBinding(mode='mount'))

train_step = PythonScriptStep(script_name="train.py",
                             arguments=["--input", data_path_pipeline_param],
                             inputs=[data_path_pipeline_param],
                             compute_target=compute_target,
                             source_directory=project_folder)

This is my code to create the pipeline with the parameters

path = DataPath(datastore=default_store, path_on_datastore='path')
input_param= (PipelineParameter(name="param_name", default_value=path), DataPathComputeBinding(mode='mount'))

parallel_run_config = ParallelRunConfig(
    source_directory=script_dir,
    entry_script='script.py',  # the user script to run against each input
    partition_keys=['key'],
    error_threshold=50,
    output_action='append_row',
    environment=environment,
    compute_target=compute_target, 
    node_count=2,
    run_invocation_timeout=1200
)

parallel_run_step = ParallelRunStep(
    name='test-batch-inference',
    inputs=[partition_input],
    side_inputs=[input1, input2, input_param],
    output=output_dir,
    parallel_run_config=parallel_run_config,
    arguments=['--input_param', input_param],
    allow_reuse=False
)

And it raised this error:

Exception: Step input must be of any type: (<class 'azureml.data.dataset_consumption_config.DatasetConsumptionConfig'>, <class 'azureml.pipeline.core.pipeline_output_dataset.PipelineOutputFileDataset'>, <class 'azureml.pipeline.core.pipeline_output_dataset.PipelineOutputTabularDataset'>, <class 'azureml.data.output_dataset_config.OutputFileDatasetConfig'>, <class 'azureml.data.output_dataset_config.OutputTabularDatasetConfig'>, <class 'azureml.data.output_dataset_config.LinkFileOutputDatasetConfig'>, <class 'azureml.data.output_dataset_config.LinkTabularOutputDatasetConfig'>), found <class 'tuple'>

I'm using azureml-core==1.40.0.post2, azureml-pipeline==1.40.0 It's seems like the sample code is not supported with these version? Before trying this datapath as pipeline parameter, I tried int type input and its just work fine


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

Zeleni9 commented 1 year ago

It seems that PythonScriptStep take tuple defined as you did up, but for the ParallelRunStep you need to provide some of these types:


datastore = Datastore(workspace=workspace, name="workspaceblobstore")
datapath = DataPath(datastore=datastore, path_on_datastore='input_data')
**data_path_pipeline_param = (PipelineParameter(name="input_data", default_value=datapath),
                           DataPathComputeBinding(mode='mount'))**

# Should be changed to this for ParallelRunStep
datastore = Datastore(workspace=workspace, name="workspaceblobstore")
datapath = DataPath(datastore=datastore, path_on_datastore='input_data')
input_data_parameter = PipelineParameter(name="input_data", default_value=datapath )
**input_data_consumption = DatasetConsumptionConfig("input_data_videos", input_data_parameter).as_mount()**