Can't use OutputFileDatasetConfig as AutoML pipeline input

metazool commented 3 years ago

We have been trying to use the OutputFileDatasetConfig of a ParallelRunStep as input to an AutoMLStep and found that we couldn't; it appears not to work as intended. The input to AutoML needs to be a registered Dataset, but calling register_on_complete on the OutputFileDatasetConfig isn't enough to be able to reuse it.

Like others logging issues here, we've been struggling to get the AutoML part of the SDK to work as the documentation suggests it should, and are running into limitations around column type inference that aren't apparent when you drive it via the web studio.

We spent a lot of time creating custom FeaturizationConfig to detect column types while trying to resolve the issue of being unable to directly reuse a pipeline created tabular dataset.
The label_column_name parameter doesn't accept integers as column indexes for datasets without header rows as the documentation says it should.
read_delimited_files() method on OutputFileDatasetConfig has different behaviour to from_delimited_files() on TabularDatasetFactory and won't produce the right inferred column types for a large number of columns.

In outline what we are doing is this:

# This step produces a CSV with a label and 1600 numeric columns
features_output = OutputFileDatasetConfig(name="feature_extraction").read_delimited_files().register_on_complete('dataset-name')
batch_feature_extract_step = ParallelRunStep(
    name=parallel_step_name,
    inputs=[previous_output.as_input()],
    output=features_output,
    parallel_run_config=parallel_run_config,
    allow_reuse=False
)

automl_settings = {
    "experiment_timeout_minutes": 40, 
    "max_concurrent_iterations": 4,
    "primary_metric" : 'accuracy' 
}

featurization_config = FeaturizationConfig()
for i in range(2, 1602):
    featurization_config.add_column_purpose(f'Column{i}', 'Numeric')

# This step attempts to use the previous step's output as input
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             path = "../src",
                             training_data=features_output.as_input()
                             label_column_name="Column1", # int doesn't work like docs say
                             enable_early_stopping= True,
                             featurization=featurization_config, # turn off leads to error 
                             debug_log = "automl_errors.log",
                             test_size = 0.1,
                             validation_size= 0.1,
                             **automl_settings
                            )

automl_step = AutoMLStep(
     name='automl_classification',
     automl_config=automl_config,
     outputs=[model_output, metrics_output],
     allow_reuse=True)

pipeline = Pipeline(workspace=ws, steps=[parallel_step, automl_step])

We're seeing errors like this:

raise ValueError("The DatasetConsumptionConfig for {} must be constructed with a ".format(arg) +
ValueError: The DatasetConsumptionConfig for training_data must be constructed with a TabularDataset or OutputTabularDatasetConfig.

Even though we are calling read_delimited_files on the OutputFileDatasetConfig to convert it to tabular form as the documentation suggests (with sparse detail) we should.

Feeding AutoML the output of a previous pipeline step seems like a reasonable thing to want to do; is there a better suggested workaround than breaking our pipeline in half? We've also seen the behaviours in #1494 and #1605 - it generally feels like the AutoML SDK is undercooked, what are we missing that the web studio internally is doing? Is it possible to provide a more informative notebook for those of us struggling with it, than the one available in this repository?

https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputtabulardatasetconfig?view=azure-ml-py

metazool commented 3 years ago

We know this isn't a flaw in the notebook repository, rather in the underlying Azure ML SDK, but there doesn't seem to be a public repo where we can lodge SDK specific issues and or look at the product team's bug backlog

AHaryanto commented 2 years ago

I had the same issue with label_column_name and integer.

"message": "Argument [label_column_name] is of unsupported type: [<class 'int'>]. Supported type(s): [int, str]"

azureml-sdk v.1.39.0

mjgolebiewski commented 2 years ago

encountered same issue. @metazool did you managed to find any workaround?

this helped me (azureML docs)

metazool commented 1 year ago

No, we never did find a fix - ended up breaking the pipeline into two, registering a TabularDataset at the end of the first one and passing its name into the second.

Meanwhile MS have rushed on to Azure ML SDK v2 / AutoML for Images, and every experience of them just feels like lost time sunk into undercooked and underdocumented code. It's such promising functionality but really missing a nail on the user experience,

byronverzmoter commented 1 year ago

@metazool this might be a bit late but I had the same issue when trying to pass data to an AutoML step, mainly the horrible dtype inference was where I gave up on csvs.

Try writing the tabular data (assuming a pandas dataframe) to a parquet file and then using .read_parquet_files().register_on_complete(). As you mentioned the .read_delimited_files() is different to .from_delimited_files() but the parquet method seems to have some consistency (at least in my experience)

Azure / MachineLearningNotebooks

Can't use OutputFileDatasetConfig as AutoML pipeline input #1607