Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

Error: The input data is empty. Ensure data correctness and availability. #1374

Closed levalencia closed 3 years ago

levalencia commented 3 years ago

I have the following code, and I am very sure the dataset is not empy!

workspace = Workspace(subscription_id, resource_group, workspace_name)

dstraining_datasensor1 = Dataset.get_by_name(workspace, name='sensor1')

from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parametersSensor1 = ForecastingParameters(time_column_name='EventEnqueuedUtcTime', 
                                               forecast_horizon=5,
                                               time_series_id_column_names=["eui"],
                                               freq='H',
                                               target_lags='auto',
                                               target_rolling_window_size=10)

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
import logging

amlcompute_cluster_name = "computecluster"
compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
experiment_name = 'iot-forecast'

experiment = Experiment(ws, experiment_name)

automl_configSensor1 = AutoMLConfig(task='forecasting',
                             primary_metric='normalized_root_mean_squared_error',
                             experiment_timeout_minutes=100,
                             enable_early_stopping=True,
                             training_data=dstraining_datasensor1,
                             compute_target = compute_target,
                             label_column_name='TempC_DS',
                             n_cross_validations=5,
                             enable_ensembling=False,
                             verbosity=logging.INFO,
                             forecasting_parameters=forecasting_parametersSensor1)

remote_run = experiment.submit(automl_configSensor1, show_output=True)

However after some minutes, in the experiment I get this:

Status Failed  Error: The input data is empty. Ensure data correctness and availability.

I checked the dataset and its definitely not empty

fausttiger007 commented 3 years ago

similar issue, classification, no missing data, no nulls.
azureml-core 1.24, azureml-train-automl 1.24

automl_config = AutoMLConfig(name=experiment_name, task='classification', compute_target=training_cluster, training_data = train_ds, validation_data = test_ds, label_column_name=goal, iterations=6, primary_metric = 'AUC_weighted',

primary_metric = 'accuracy',

                         #primary_metric = 'average_precision_score_weighted',
                         #primary_metric = 'norm_macro_recall',
                         max_concurrent_iterations=6,
                         featurization='auto'
                         )

ValidationException Traceback (most recent call last)

in 4 print('Submitting Auto ML experiment...') 5 automl_experiment = Experiment(ws, experiment_name) ----> 6 automl_run = automl_experiment.submit(automl_config) 7 RunDetails(automl_run).show() 8 automl_run.wait_for_completion(show_output=True)
fausttiger007 commented 3 years ago

Traceback: File "experiment_driver.py", line 213, in start kwargs=kwargs File "remote_experiment_launcher.py", line 79, in start validation_data=validation_data, test_data=test_data) File "driver_utilities.py", line 310, in start_remote_run training_data=training_data, validation_data=validation_data, test_data=test_data) File "driver_utilities.py", line 367, in _create_remote_parent_run test_data=test_data, File "driver_utilities.py", line 99, in create_and_validate_parent_run_dto validate_input(experiment_state, parent_run_dto) File "driver_utilities.py", line 138, in validate_input ExecutionFailure, operation_name="data/settings validation", error_details=msg)

ExceptionTarget: Unspecified 2021-03-10 05:48:24.885 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.fail_parent_run:245 - No parent run to fail 2021-03-10 05:51:36.333 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package azureml-train-automl missing from dependencies file. 2021-03-10 05:51:36.514 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: azureml-train-automl==1.24.0.* 2021-03-10 05:51:36.520 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package pandas missing from dependencies file. 2021-03-10 05:51:36.527 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: pandas==0.25.1 2021-03-10 05:51:36.533 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package psutil missing from dependencies file. 2021-03-10 05:51:36.540 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: psutil>5.0.0,<6.0.0 2021-03-10 05:51:36.547 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package scikit-learn missing from dependencies file. 2021-03-10 05:51:36.553 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: scikit-learn==0.22.1 2021-03-10 05:51:36.560 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package numpy missing from dependencies file. 2021-03-10 05:51:36.570 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: numpy~=1.18.0 2021-03-10 05:51:36.577 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package py-xgboost missing from dependencies file. 2021-03-10 05:51:36.582 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: py-xgboost<=0.90 2021-03-10 05:51:36.590 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package inference-schema missing from dependencies file. 2021-03-10 05:51:36.596 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package fbprophet missing from dependencies file. 2021-03-10 05:51:36.604 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:205 - Using pinned version: fbprophet==0.5 2021-03-10 05:51:36.605 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:199 - Package setuptools-git missing from dependencies file. 2021-03-10 05:51:36.605 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.modify_run_configuration:209 - Using installed version: setuptools-git==1.2 2021-03-10 05:51:36.622 - INFO - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.validate_input:105 - Start data validation. 2021-03-10 05:51:45.379 - CRITICAL - 29333 - azureml.train.automl._experiment_drivers.driver_utilities.log_traceback:224 - Type: ExecutionFailure

Class: ValidationException Message: ValidationException: Message: Failed to execute the requested operation: data/settings validation. Error details: Validation error(s): [{ "additional_properties": { "debugInfo": null }, "code": "UserError", "severity": 2, "message": "The input data is empty. Ensure data correctness and availability.", "message_format": "The input data is empty. Ensure data correctness and availability.", "message_parameters": { "0": "System.Collections.Generic.Dictionary`2[System.String,System.String]" }, "reference_code": null, "details_uri": null, "target": "training_data", "details": [ { "additional_properties": { "debugInfo": null }, "code": null, "severity": null, "message": "null", "message_format": null, "message_parameters": {}, "reference_code": null, "details_uri": null, "target": null, "details": [], "inner_error": null } ], "inner_error": { "additional_properties": {}, "code": "BadData", "inner_error": { "additional_properties": {}, "code": "EmptyData", "inner_error": { "additional_properties": {}, "code": "DatasetEmptyDatafile", "inner_error": null } } } }]

fausttiger007 commented 3 years ago

Could it be that the TabularDataset connection to the source registered dataset is broken, and hence it's not pulling any data? Registered tabular dataset, broken into training and testing tabular datasets (which are not registered) using TabularDataset compatible filtering. These technically are pointers to the source registered tabular dataset and only apply actual transformations/filtering when data pulled by compute.

fausttiger007 commented 3 years ago

Looks like in my case... the problem is using the .FIlter() construct to filter out data

WORKS

train_ds = st130_dataset_ds.keep_columns(keep) test_ds = st130_dataset_ds.keep_columns(keep)

DOES NOT WORK

train_ds = st130_dataset_ds.keep_columns(keep).filter((st130_dataset_ds['partition']=='train') & (st130_dataset_ds['WeightConsign_CountFrom_mod10']==True)) test_ds = st130_dataset_ds.keep_columns(keep).filter((st130_dataset_ds['partition']=='score') & (st130_dataset_ds['WeightConsign_CountFrom_mod10']==True))

the .Filter() is causing AutoML to think there is no data... it's executing the Filter when it pulls the data.

HOWEVER, IF I CHECK DATA IN TABULAR DATASET (pull into Pandas), there is valid data there. train_sample = train_ds.to_pandas_dataframe() train_sample.head(10)

AutoML on remote cluster is not really being passed train_ds and test_ds tabular data, but a pointer to those tabular datasets (plus any filter instructions). But in this case the remote cluster is somehow not seeing the data post-filter from the tabular datasets.

source tabular dataset is registered... train_ds and test_ds are not registered (but supposedly do not need to be registered). Guess I can prep data beforehand and not use Filter(), but less ease of use. Other filters like supported .keep_columns() do work in this case on compute cluster.

fausttiger007 commented 3 years ago

I have the following code, and I am very sure the dataset is not empy!

workspace = Workspace(subscription_id, resource_group, workspace_name)

dstraining_datasensor1 = Dataset.get_by_name(workspace, name='sensor1')

from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parametersSensor1 = ForecastingParameters(time_column_name='EventEnqueuedUtcTime', 
                                               forecast_horizon=5,
                                               time_series_id_column_names=["eui"],
                                               freq='H',
                                               target_lags='auto',
                                               target_rolling_window_size=10)

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
import logging

amlcompute_cluster_name = "computecluster"
compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
experiment_name = 'iot-forecast'

experiment = Experiment(ws, experiment_name)

automl_configSensor1 = AutoMLConfig(task='forecasting',
                             primary_metric='normalized_root_mean_squared_error',
                             experiment_timeout_minutes=100,
                             enable_early_stopping=True,
                             training_data=dstraining_datasensor1,
                             compute_target = compute_target,
                             label_column_name='TempC_DS',
                             n_cross_validations=5,
                             enable_ensembling=False,
                             verbosity=logging.INFO,
                             forecasting_parameters=forecasting_parametersSensor1)

remote_run = experiment.submit(automl_configSensor1, show_output=True)

However after some minutes, in the experiment I get this:

Status Failed  Error: The input data is empty. Ensure data correctness and availability.

I checked the dataset and its definitely not empty

levalencia ,

Do you need a "validation_data =" setting in your config?

kimix92 commented 3 years ago

Hello,

We are looking into this issue and will reach back if any extra information is needed.

kimix92 commented 3 years ago

The error seems to be a data validation issue. Using a remote run will result in an error. We have handled it before and looking to avoid it further. If you retry using local compute and pandas dataframe as input, it should work. Please try this and let us know if you still get an error.

fausttiger007 commented 3 years ago

Switching to local compute causes identical error with filter() applied to tabular dataset.

Removing the .filter(….) statement from the tabularDataSet prep and it runs, local or remote.

dataset.keep_columns() method works fine with tabular dataset dataset.split() methodsworks fine with tabular dataset

it’s only the Tabulardataset.filter() method that causes the empty dataset/validation error… even if I can pull all data into a pandas dataframe.

But I’m running AutoML, which requires TabularDataSet for full functionality, and pandas isn’t a substitute.

Can’t use this AutoML workflow with Pandas 1922 AzureMLError.create( 1923 InvalidInputDatatype, target=target, input_type=input_type, -> 1924 supported_types=SupportedInputDatatypes.TABULAR_DATASET

I could data prep data in pandas, and then register as tabularDataset for remote compute run… but that requires multiple registered datasets instead of just one with changes to filter/keep_columns Methods.

On Mar 12, 2021, at 3:05 PM, Kiana @.***> wrote:

The error seems to be a data validation issue. Using a remote run will result in an error. We have handled it before and looking to avoid it further. If you retry using local compute and pandas dataframe as input, it should work. Please try this and let us know if you still get an error.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-797783062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47L2YYDGKRVSSJ2WLHYTTDJ6YPANCNFSM4YMW6IPA.

fausttiger007 commented 3 years ago

cluster_name = “local_compute"

training_cluster = ComputeTarget(workspace=ws, name=cluster_name)

automl_config = AutoMLConfig(name=experiment_name, task='classification', compute_target=training_cluster, training_data = train_ds, validation_data = test_ds, label_column_name=goal, iterations=2, primary_metric = 'AUC_weighted',

primary_metric = 'accuracy',

                         #primary_metric = 'average_precision_score_weighted',
                         #primary_metric = 'norm_macro_recall',
                         max_concurrent_iterations=4,
                         featurization='auto'
                         )

from azureml.core.experiment import Experiment from azureml.widgets import RunDetails

print('Submitting Auto ML experiment...') automl_experiment = Experiment(ws, experiment_name) automl_run = automl_experiment.submit(automl_config)

On Mar 12, 2021, at 3:05 PM, Kiana @.***> wrote:

The error seems to be a data validation issue. Using a remote run will result in an error. We have handled it before and looking to avoid it further. If you retry using local compute and pandas dataframe as input, it should work. Please try this and let us know if you still get an error.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-797783062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47L2YYDGKRVSSJ2WLHYTTDJ6YPANCNFSM4YMW6IPA.

CESARDELATORRE commented 3 years ago

Following up with this issue. I believe there can be multiple/different bugs here, since the initial AutoML code from @levalencia is not using Filter() but the issue from @fausttiger uses Filter() and also fails in local runs.

@levalencia can you provide the AutoML parent Run ID where this issue happened to you?

@fausttiger can you provide the multiple AutoML parent Run IDs where this issue happened to you? (One Run ID for REMOTE RUN and one Run ID for LOCAL RUN)

Also, if you are willing to provide a sample dataset and sample notebook with the repro, please, send it to me by email to: cesardl at microsoft dot com

fausttiger007 commented 3 years ago

Agreed, levalencia is not apparently using FILTER() so different issue.

However, I don’t have RUN IDs when using FILTER(), because AutoML doesn’t see any data, hence it just reports data validation error (empty data) and terminates starting a new run.

Somehow FILTER() is keeping the tabular datasets from being passed to local or remote compute via AutoML.

If I understand correctly… data is not actually passed to AutoML.. but rather the TabularDataData location and the instructions to transform it ( FILTER(), KEEP_COLUMNS, SPLIT_RANDOM, etc.) and the compute then directly pulls from the tabular dataset?

I can in the stream pull all data from tabular datasets into pandas data frame, so the data is there, at least before data instructions are passed to AutoML.

On Mar 15, 2021, at 12:23 PM, Cesar De la Torre @.***> wrote:

Following up with this issue. I believe there can be multiple/different bugs here, since the initial AutoML code from @levalencia https://github.com/levalencia is not using Filter() but the issue from @fausttiger https://github.com/fausttiger uses Filter() and also fails in local runs.

@levalencia https://github.com/levalencia can you provide the AutoML parent Run ID where this issue happened to you?

@fausttiger https://github.com/fausttiger can you provide the AutoML parent Run ID where this issue happened to you? (One Run ID for REMOTE RUN and one Run ID for LOCAL RUN)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-799645589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47L6WEERQSHUOB5TBVJTTDZGBJANCNFSM4YMW6IPA.

CESARDELATORRE commented 3 years ago

@fausttiger Could you send me a pointer to the sample dataset and notebook to repro the issue? email to: cesardl at microsoft dot com

fausttiger007 commented 3 years ago

Cesar

I’ll have to see if this is possible… this is my personal GitHub account, but this error is on one of our corporate Azure subscriptions. We’re not allowed to post to GitHub forums from our Enterprise accounts… hence I was posting via my personal.

There may be a sanitized dataset and simpler notebook I can cobble together to reproduce.

On Mar 15, 2021, at 1:38 PM, Cesar De la Torre @.***> wrote:

@fausttiger https://github.com/fausttiger Could you send me a pointer to the sample dataset and notebook to repro the issue? email to: cesardl at microsoft dot com

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-799699007, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47L5SHAPYEI4WQ6QOLG3TDZO3XANCNFSM4YMW6IPA.

CESARDELATORRE commented 3 years ago

@fausttiger Please, send it to my email (or pointer to it) so we can repro and start investigating the bug/issue, ok? 👍

fausttiger007 commented 3 years ago

Maybe I can reproduce with the original diabetes dataset for this DP100 repo notebook.

On Mar 15, 2021, at 4:59 PM, Cesar De la Torre @.***> wrote:

@fausttiger https://github.com/fausttiger Please, send it to my email (or pointer to it) so we can repro and start investigating the bug/issue, ok? 👍

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-799810333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47LZBMP443GMAMOZIK53TD2GNXANCNFSM4YMW6IPA.

fausttiger007 commented 3 years ago

Cesar

Been testing various options and attempting to reproduce in a smaller focused dataset and notebook to easily handoff… but,

I'm beginning to suspect this might be some data size/memory issue with Filter() when AutoML is pulling the tabular dataset for its data validation phase…

As whenever I attempt to replicate the problems with smaller datasets, it works. …. Even if Filter() is used.
But fails with larger datasets. Without Filter(), other tabular dataset methods work with larger datasets.

This current tabular dataset is 192K records with 96 columns (110M)… even though my Filter() reduces this to about 7K records and 30 columns (into train and test tabular datasets)… I suspect the problem is before this data reduction (Pandas can still pull all the data from train and test datasets within the notebook)…, but AutoML is having issues when the tabular dataset/filter() instructions are passed to it for data validation and execution.

This Large dataset failed to validate (empty dataset)

I removed all but 2,000 records and it runs, Filters and all….

And I even attempted to replicate (using Filter()) with the example diabetes data (10,000 recs X 10 cols) and notebook (only a couple lines added to add Filter(s) )

08B - Using Automated Machine Learning_Test.ipynb From https://github.com/MicrosoftLearning/DP100 https://github.com/MicrosoftLearning/DP100

With FILTER() commands added… and it works.

Maybe I can synthesize a large and small dataset with essentially the same data to reproduce my hypothesis. Maybe just bloat the diabetes dataset with duplicate data to test.

On Mar 15, 2021, at 4:59 PM, Cesar De la Torre @.***> wrote:

@fausttiger https://github.com/fausttiger Please, send it to my email (or pointer to it) so we can repro and start investigating the bug/issue, ok? 👍

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-799810333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47LZBMP443GMAMOZIK53TD2GNXANCNFSM4YMW6IPA.

CESARDELATORRE commented 3 years ago

Hi @fausttiger I think you are on the right track. We believe the empty data error is probably caused by our quick profile result that automl validation is using. We'll involve the devs related to the quick dataset profile validation in order to investigate this issue/bug. Thanks for your feedback! 👍

CESARDELATORRE commented 3 years ago

@fausttiger Would you be able to provide us your data reproducing the issue (the larger dataset?) with mock data of same size replacing your values with random ones?

CESARDELATORRE commented 3 years ago

@fausttiger Hi, the team checked few things and it seems that when you have more than 10K rows and you also use filter that doesn't match any rows in first 10K rows then this error happens. We reproed it like this:

image

We'll create a bug to fix the error message.

As workaround for you, looks there are two options:

@fausttiger Can you confirm if this is aligned to your issue or if you think it's a different issue? Thanks! 👍

fausttiger007 commented 3 years ago

Thanks Cesar,

Sounds reasonable… The 1st of 2 filtered fields was a PARTITION field ( train or score) which would be the same in the first 10K records. There was a second field that was mostly every 10 records [ i.e. Filter ((condition 1) & (condition 2) ) ] but maybe one condition not changing in 10K is adequate to cause the issue.

Also explains why it worked with smaller subsets of records (or when I duplicated the Diabetes test database from 10K to 1M rows it still failed as I just duplicated rows for the test).

Use the full profile for the validation AutoML does (You need to have generated a Dataset profile, first, from the AML UI Dataset page). How do I reference the Dataset profile for data validation by AutoML? (the dataset was profiled already)

I’ve profiled manually before from the workspace.. I assume it can be checked for existence and generated programmatically? Which method?

FYI, I also did the data prep Filtering in Pandas, and then wrote data back out to Tabular DataSet and ran AutoML successfully without issue (resulting train and test datasets are 11K and 7K) with no tabular FILTER() method used, only .Keep_Columns().

Run ID: AutoML_27fd175b-7db7-4060-9417-c03857e7b64c:

ITERATION PIPELINE DURATION METRIC BEST 5 MinMaxScaler RandomForest 0:00:50 0.8970 0.8970 0 MaxAbsScaler LightGBM 0:00:54 0.8341 0.8970 6 StandardScalerWrapper RandomForest 0:00:57 0.8965 0.8970 2 MinMaxScaler RandomForest 0:00:52 0.8952 0.8970 3 RobustScaler ExtremeRandomTrees 0:01:36 0.8043 0.8970 4 MinMaxScaler RandomForest 0:01:43 0.8894 0.8970 1 MaxAbsScaler XGBoostClassifier 0:00:47 0.8895 0.8970 8 MinMaxScaler ExtremeRandomTrees 0:00:52 0.8269 0.8970 7 MinMaxScaler ExtremeRandomTrees 0:00:54 0.9097 0.9097 10 RobustScaler ExtremeRandomTrees 0:00:45 0.8521 0.9097 9 MinMaxScaler ExtremeRandomTrees 0:00:46 0.7140 0.9097 11 StandardScalerWrapper RandomForest 0:00:51 0.8026 0.9097 12 StandardScalerWrapper SGD 0:00:49 0.7199 0.9097 13 RobustScaler RandomForest 0:00:53 0.7654 0.9097 14 MinMaxScaler RandomForest 0:00:47 0.7792 0.9097 16 MaxAbsScaler RandomForest 0:00:48 0.6700 0.9097 17 StandardScalerWrapper XGBoostClassifier 0:00:47 0.9300 0.9300 15 MinMaxScaler ExtremeRandomTrees 0:00:58 0.7040 0.9300 18 MaxAbsScaler RandomForest 0:00:48 0.8932 0.9300 19 MaxAbsScaler ExtremeRandomTrees 0:00:44 0.7864 0.9300 20 MaxAbsScaler ExtremeRandomTrees 0:00:48 0.7721 0.9300 21 StandardScalerWrapper XGBoostClassifier 0:01:08 0.8158 0.9300 22 VotingEnsemble 0:00:56 0.9315 0.9315 23 StackEnsemble 0:01:06 0.8713 0.9315

Although best_run, fitted_model = automl_run.get_output()

did cause an /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/automl/runtime/short_grain_padding.py in 9 10 from pandas.core.dtypes.dtypes import CategoricalDtype ---> 11 from pandas.tseries.offsets import OutOfBoundsDatetime 12 13 from azureml._common._error_definition.azureml_error import AzureMLError

ImportError: cannot import name 'OutOfBoundsDatetime'

But that’s a different issue.

On Mar 17, 2021, at 6:09 PM, Cesar De la Torre @.***> wrote:

@fausttiger https://github.com/fausttiger Hi, the team checked few things and it seems that when you have more than 10K rows and you also use filter that doesn't match any rows in first 10K rows then this error happens. We reproed it like this:

https://user-images.githubusercontent.com/1712635/111554172-09afb200-8743-11eb-8143-5d2ed7f336bf.png We'll create a bug to fix the error message.

As workaround for you, looks there are two options:

Shuffle the data so that the filter condition has some rows in the first 10K rows (quick profile of the data) Use the full profile for the validation AutoML does (You need to have generated a Dataset profile, first, from the AML UI Dataset page). @fausttiger https://github.com/fausttiger Can you confirm if this is aligned to your issue or if you think it's a different issue? Thanks! 👍

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-801521281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47L3DABR7QNSK3GTJUJDTEFAEVANCNFSM4YMW6IPA.

CESARDELATORRE commented 3 years ago

@fausttiger We confirmed the issue with the Azure ML dataprep team and above observation I was saying is correct:

https://ml.azure.com/experiments/id/9a7e6270-1b34-44a5-93ac-45b90d089c1e/runs/AutoML_XXXXXXXX-ca76-4d6d-b563-9766df84821b?wsid=/subscriptions/XXXXXX-9840-4719-a5a0-61d9585e1e91/resourceGroups/sasum_centraluseuap_rg/providers/Microsoft.MachineLearningServices/workspaces/sasum_centraluseuap_ws&flight=validationwarning&tid=XXXXXXXX-86f1-41af-91ab-2d7cd011db47

Basically, the current workarounds until we have the fix in AutoML services are the following:

Thanks,

fausttiger007 commented 3 years ago

Cesar

1) I’m triggering runs from Python. for example, from Python I check to see if the dataset is already registered, and if not, I create it and register it from Python.

2) I’ll make sure I have a Profile and test.

PROFILE EXISTS FOR REGISTERED DATASET: Number of columns: 69, Number of rows: 1000 (of 192624)

One of 2 filtered fields does not change till about 50% into dataset (the Partition field = Train/Score )

SAME ERROR from automl_experiment.submit(automl_config)

3) Not sure what you mean by PROFILE DATASET IN ADVANCED?

I’m profiling from the Home/Datasets/datasetname ML Workspace.

4) Note I ’m getting registered / profiled Tabular dataset vis ws.datasets.get()

and then creating separate TRAIN_DS and TEST_DS datasets in separate statements with their own FILTER() criteria

so… source datasets is profiled, but TRAIN_DS and TEST_DS are not, but are derived from profiled dataset.

(I thought these 2 datasets are only created from source tabular in datastore when referenced, following methods attached … either local or remote ?)

get source registered dataset

st130_dataset_ds = ws.datasets.get(st130_dataset)

filter by partition field, and mod10 field (every 10th part)

train_ds = st130_dataset_ds.filter((st130_dataset_ds['partition'] == "train") & (st130_dataset_ds['WeightConsign_CountFrom_mod10']==True)).keep_columns(keep) test_ds = st130_dataset_ds.filter((st130_dataset_ds['partition'] == "score") & (st130_dataset_ds['WeightConsign_CountFrom_mod10']==True)).keep_columns(keep)

automl_config = AutoMLConfig(name=experiment_name, task='classification', compute_target=training_cluster, training_data = train_ds, validation_data = test_ds, label_column_name=goal, iterations=2, primary_metric = 'AUC_weighted',

primary_metric = 'accuracy',

                         #primary_metric = 'average_precision_score_weighted',
                         #primary_metric = 'norm_macro_recall',
                         max_concurrent_iterations=4,
                         featurization='auto'
                         )

automl_experiment = Experiment(ws, experiment_name) automl_run = automl_experiment.submit(automl_config)

ValidationException: ValidationException: Message: Failed to execute the requested operation: data/settings validation. Error details: Validation error(s): [{ "additional_properties": { "debugInfo": null }, "code": "UserError", "severity": 2, "message": "The input data is empty. Ensure data correctness and availability.”,

5) BTW, the following mentioned error from automotive_run.get_output() was really a Panda error, as I had Pandas 1.X loaded instead of 0.25.3. Correctly now.

from azureml._common._error_definition.azureml_error import AzureMLError ImportError: cannot import name ‘OutOfBoundsDatetime'

I also fixed a few azureml dependencies to make sure they all were compatible with 1.24.0

On Mar 18, 2021, at 12:19 PM, Cesar De la Torre @.***> wrote:

@fausttiger https://github.com/fausttiger We confirmed the issue with the Azure ML dataprep team and above observation I was saying is correct:

If you already profiled the dataset in advanced then it would be picked up when you submits a new run and then we shouldn't face this issue. (Confirm/double check this, since I believe that you said you already profiled the dataset? Was that done previously to the training run?)

If you are is triggering a run from UI, and the dataset was not profiled yet, you can add this in the url like this to have full profile for the data considered (basically, the URL parameter flight=validationwarning):

https://ml.azure.com/experiments/id/9a7e6270-1b34-44a5-93ac-45b90d089c1e/runs/AutoML_97388960-ca76-4d6d-b563-9766df84821b?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/sasum_centraluseuap_rg/providers/Microsoft.MachineLearningServices/workspaces/sasum_centraluseuap_ws&**flight=validationwarning**&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 https://ml.azure.com/experiments/id/9a7e6270-1b34-44a5-93ac-45b90d089c1e/runs/AutoML_97388960-ca76-4d6d-b563-9766df84821b?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/sasum_centraluseuap_rg/providers/Microsoft.MachineLearningServices/workspaces/sasum_centraluseuap_ws&**flight=validationwarning**&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 In short "Quick profile with filter() applied cannot be trusted for empty results", so will create a BUG for that and fix on the validation service side. @fausttiger https://github.com/fausttiger Still, I'm not 100% this is exactly the issue you are experiencing if you confirm that you actually profiled the dataset in advanced before triggering the training run that got the errors. Can you double-confirm this point?

Thanks,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-802180718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47LYSHVAWFTKPZZ3UYILTEI73PANCNFSM4YMW6IPA.

fausttiger007 commented 3 years ago

Cesar

I’ll also note if I change the Filtered field to a value that doesn’t exist in the dataset, there is no error reported when executed….

i.e. train_ds = st130_dataset_ds.filter((st130_dataset_ds[’FIELD_DOESNT_EXIST'] == "train") & (st130_dataset_ds['WeightConsign_CountFrom_mod10']==True)).keep_columns(keep)

On Mar 18, 2021, at 12:19 PM, Cesar De la Torre @.***> wrote:

@fausttiger https://github.com/fausttiger We confirmed the issue with the Azure ML dataprep team and above observation I was saying is correct:

If you already profiled the dataset in advanced then it would be picked up when you submits a new run and then we shouldn't face this issue. (Confirm/double check this, since I believe that you said you already profiled the dataset? Was that done previously to the training run?)

If you are is triggering a run from UI, and the dataset was not profiled yet, you can add this in the url like this to have full profile for the data considered (basically, the URL parameter flight=validationwarning):

https://ml.azure.com/experiments/id/9a7e6270-1b34-44a5-93ac-45b90d089c1e/runs/AutoML_97388960-ca76-4d6d-b563-9766df84821b?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/sasum_centraluseuap_rg/providers/Microsoft.MachineLearningServices/workspaces/sasum_centraluseuap_ws&**flight=validationwarning**&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 https://ml.azure.com/experiments/id/9a7e6270-1b34-44a5-93ac-45b90d089c1e/runs/AutoML_97388960-ca76-4d6d-b563-9766df84821b?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/sasum_centraluseuap_rg/providers/Microsoft.MachineLearningServices/workspaces/sasum_centraluseuap_ws&**flight=validationwarning**&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 In short "Quick profile with filter() applied cannot be trusted for empty results", so will create a BUG for that and fix on the validation service side. @fausttiger https://github.com/fausttiger Still, I'm not 100% this is exactly the issue you are experiencing if you confirm that you actually profiled the dataset in advanced before triggering the training run that got the errors. Can you double-confirm this point?

Thanks,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/MachineLearningNotebooks/issues/1374#issuecomment-802180718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG47LYSHVAWFTKPZZ3UYILTEI73PANCNFSM4YMW6IPA.

CESARDELATORRE commented 3 years ago

@fausttiger What we're saying is that right now, if using the SDK-Notebook+Filter() and no rows are within the first 10K, you will get the error because the dataset profile validation (even if you created the dataset profile in advanced) is not used by AutoML by default. This is an issue/bug from us to be fixed pretty soon.

Hence, the workarounds you currently have are any of the following:

fausttiger007 commented 3 years ago

Ive already used Option B successfully, pre filtering in Ps days and outputting results back to tabular.

I’ll test option A

Tony Pines

Sent from my iPhone

On Mar 18, 2021, at 4:21 PM, Cesar De la Torre @.***> wrote:

 @fausttiger What we're saying is that right now, if using the SDK-Notebook+Filter() and no rows are within the first 10K, you will get the error because the dataset profile validation (even if you created the dataset profile in advanced) is not used by AutoML by default. This is an issue/bug from us to be fixed pretty soon.

Hence, the workarounds you currently have are any of the following:

OPTION A. Shuffle the data previously so filters could match the values in the first 10k rows. OPTION B. Create a new dataset out of filtered data before providing it to AutoMLConfig class. OPTION C. Use the UI (NOT the SDK) with the HTTP URL parameter provided above. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fausttiger007 commented 3 years ago

Option A also worked to avoid the empty data set (OPTION A. Shuffle the data previously so filters could match the values in the first 10k rows.)

it did result in a new single value for binary goal error in a subsequent validation step... so there's some 2nd tier 10K row or other grouping related error when it goes to a further validation step.... (I did make sure the goal had multiple values in the first 10K rows post Filter().

but I'm not following the problem further from here.. until Filter() is official preview or GA.

thanks.