automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.6k stars 1.28k forks source link

Getting only StatusType.TIMEOUT when running on Spark Dataframe that has been converted with toPandas() #1099

Closed krzischp closed 3 years ago

krzischp commented 3 years ago

Describe the bug

I need to extract a sample from a Spark DataFrame.
I tested 2 situations. The first one is failing for an unknown reason.
I tested the 2 situations with the same Spark and autosklearn configurations: Auto Sklearn memory_limit= 20000, time_left_for_this_task=400 seconds and per_run_time_limit=100 seconds with include_preprocessors random forest and gradient boost, n_jobs by default. Spark driver memory 30G, etc. In both cases, I ran the script with spark-submit.

To Reproduce

Test the failing situation.

Expected behavior

Same StatusType.SUCCESS as the 2nd succeeding situation.

Actual behavior, stacktrace or logfile

For the first failing situation, I got those logs:

RunValue(cost=1.0, time=100.08194756507874, status=<StatusType.TIMEOUT: 2>, starttime=1615854288.0671616, endtime=1615854389.1863475, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'gini'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.5

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 1

  classifier:random_forest:min_samples_split, Value: 2

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

 data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.01

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'mean'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'standardize'

  feature_preprocessor:__choice__, Value: 'no_preprocessing'

****************************************

RunValue(cost=1.0, time=100.11368203163147, status=<StatusType.TIMEOUT: 2>, starttime=1615854390.0143688, endtime=1615854491.1564965, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'weighting'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.6792349232781753

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 5

  classifier:random_forest:min_samples_split, Value: 12

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.007478351211361768

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'median'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'standardize'

  feature_preprocessor:__choice__, Value: 'feature_agglomeration'

  feature_preprocessor:feature_agglomeration:affinity, Value: 'euclidean'

  feature_preprocessor:feature_agglomeration:linkage, Value: 'complete'

  feature_preprocessor:feature_agglomeration:n_clusters, Value: 321

  feature_preprocessor:feature_agglomeration:pooling_func, Value: 'mean'

****************************************

RunValue(cost=1.0, time=100.02384996414185, status=<StatusType.TIMEOUT: 2>, starttime=1615854491.847664, endtime=1615854592.898519, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'False'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.20635736497355783

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 3

  classifier:random_forest:min_samples_split, Value: 16

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'no_coalescense'

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'most_frequent'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'quantile_transformer'

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles, Value: 1300

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution, Value: 'normal'

  feature_preprocessor:__choice__, Value: 'feature_agglomeration'

  feature_preprocessor:feature_agglomeration:affinity, Value: 'cosine'

  feature_preprocessor:feature_agglomeration:linkage, Value: 'average'

  feature_preprocessor:feature_agglomeration:n_clusters, Value: 386

  feature_preprocessor:feature_agglomeration:pooling_func, Value: 'median'

****************************************

RunValue(cost=1.0, time=50.05536484718323, status=<StatusType.TIMEOUT: 2>, starttime=1615854593.5711422, endtime=1615854644.6575012, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.912689259437897

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 12

  classifier:random_forest:min_samples_split, Value: 11

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'no_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.2533271508321726

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'median'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'quantile_transformer'

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles, Value: 275

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution, Value: 'normal'

  feature_preprocessor:__choice__, Value: 'no_preprocessing'

****************************************

RunValue(cost=2147483647.0, time=0.0, status=<StatusType.RUNNING: 9>, starttime=0.0, endtime=0.0, additional_info=None)

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.4617335248365182

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 1

  classifier:random_forest:min_samples_split, Value: 6

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.021330600464414196

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'mean'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'quantile_transformer'

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles, Value: 925

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution, Value: 'normal'

  feature_preprocessor:__choice__, Value: 'fast_ica'

  feature_preprocessor:fast_ica:algorithm, Value: 'parallel'

  feature_preprocessor:fast_ica:fun, Value: 'logcosh'

  feature_preprocessor:fast_ica:whiten, Value: 'False'

Environment and installation:

mfeurer commented 3 years ago

Thanks for reporting this issue @krzischp. I'm not familiar with spark so I'm afraid I can't help here but only guess. Because the 2nd option works I suspect there is some weird interaction between Auto-sklearn and the way spark converts the data into a pandas dataframe. How long does it take the random forest to fit in the working case?

krzischp commented 3 years ago

Hi @mfeurer thanks for the answer!

The memory consumption is pretty horrible when using Spark toPandas() function. It seems like the memory usage is a lot larger than when using pandas directly.

PyArrow could resolve that issue but in my case, I'm using Spark 2.4.0 and I would get a Pandas incompatibility because of other most recent libraries I'm using.

So the simpler solution I found is to convert this toPandas() dataframe into a csv and read it again with pandas.read_csv.

Concerning the Auto Sklearn logs, I just didn't understand why they weren't showing out of memory exceptions after hours of execution. They're only showing timeout, as the executing processes are probably waiting all this time for memory to be freed.

mfeurer commented 3 years ago

They're only showing timeout, as the executing processes are probably waiting all this time for memory to be freed.

Are you running in a sequential fashion? If yes, Auto-sklearn by default uses fork to copy data to the subprocess which might lead to performance degration. You could try setting n_jobs=2 or passing a dask client with a single worker and see if this resolves your problem.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity.