Open metazool opened 3 years ago
We know this isn't a flaw in the notebook repository, rather in the underlying Azure ML SDK, but there doesn't seem to be a public repo where we can lodge SDK specific issues and or look at the product team's bug backlog
I had the same issue with label_column_name
and integer.
"message": "Argument [label_column_name] is of unsupported type: [<class 'int'>]. Supported type(s): [int, str]"
azureml-sdk v.1.39.0
encountered same issue. @metazool did you managed to find any workaround?
No, we never did find a fix - ended up breaking the pipeline into two, registering a TabularDataset at the end of the first one and passing its name into the second.
Meanwhile MS have rushed on to Azure ML SDK v2 / AutoML for Images, and every experience of them just feels like lost time sunk into undercooked and underdocumented code. It's such promising functionality but really missing a nail on the user experience,
@metazool this might be a bit late but I had the same issue when trying to pass data to an AutoML step, mainly the horrible dtype inference was where I gave up on csvs.
Try writing the tabular data (assuming a pandas dataframe) to a parquet file and then using .read_parquet_files().register_on_complete()
. As you mentioned the .read_delimited_files()
is different to .from_delimited_files()
but the parquet method seems to have some consistency (at least in my experience)
We have been trying to use the
OutputFileDatasetConfig
of aParallelRunStep
as input to anAutoMLStep
and found that we couldn't; it appears not to work as intended. The input to AutoML needs to be a registered Dataset, but calling register_on_complete on theOutputFileDatasetConfig
isn't enough to be able to reuse it.Like others logging issues here, we've been struggling to get the AutoML part of the SDK to work as the documentation suggests it should, and are running into limitations around column type inference that aren't apparent when you drive it via the web studio.
label_column_name
parameter doesn't accept integers as column indexes for datasets without header rows as the documentation says it should.read_delimited_files()
method onOutputFileDatasetConfig
has different behaviour tofrom_delimited_files()
onTabularDatasetFactory
and won't produce the right inferred column types for a large number of columns.In outline what we are doing is this:
We're seeing errors like this:
Even though we are calling
read_delimited_files
on the OutputFileDatasetConfig to convert it to tabular form as the documentation suggests (with sparse detail) we should.Feeding AutoML the output of a previous pipeline step seems like a reasonable thing to want to do; is there a better suggested workaround than breaking our pipeline in half? We've also seen the behaviours in #1494 and #1605 - it generally feels like the AutoML SDK is undercooked, what are we missing that the web studio internally is doing? Is it possible to provide a more informative notebook for those of us struggling with it, than the one available in this repository?
https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputtabulardatasetconfig?view=azure-ml-py