Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

How to create Pipeline parameters for data stored in DataLakeGen2 and use in Azure Synapse/Data Factory? #1784

Open hiob95 opened 2 years ago

hiob95 commented 2 years ago

I am trying to create pipeline parameters for variable data access to a Synapse DataLakeGen2 datastore and invoke the pipeline with the 'Machine Learning Execute Pipeline' activity in Azure Synapse . According to the microsoft docs, datasets are the recommended way for interaction with the AzureDataLakeGen2Datastore class. I have verified this by trying to use DataPathComputeBinding with either the 'mount' or the 'download' mode, neither of which are supported for Gen2 datastores. So then I tried the DatasetConsumptionConfig class to pass the data to the compute target, which requires a dataset as a pipeline parameter. Unfortunately, the 'Machine Learning Execute Pipeline activity' only supports string or DataPath variables, so I could not find a way to pass a Dataset: image I then tried to use the DataPath as parameter input and convert it to a dataset, but the PipelineParameter class does not seem to provide any methods to retrieve the underlying DataPath:

datapath = DataPath(datastore=datastore, path_on_datastore=path)
data_path_pipeline_param = (PipelineParameter(name="input_data", default_value=datapath))

#does not work
dataset_parquet = Dataset.Tabular.from_parquet_files(data_path_pipeline_param)
ds_consumption = DatasetConsumptionConfig("input", dataset_parquet)

Is there a recommended way to do this?