Getting only StatusType.TIMEOUT when running on Spark Dataframe that has been converted with toPandas()

krzischp commented 3 years ago

Describe the bug

I need to extract a sample from a Spark DataFrame.
I tested 2 situations. The first one is failing for an unknown reason.
I tested the 2 situations with the same Spark and autosklearn configurations: Auto Sklearn memory_limit= 20000, time_left_for_this_task=400 seconds and per_run_time_limit=100 seconds with include_preprocessors random forest and gradient boost, n_jobs by default. Spark driver memory 30G, etc. In both cases, I ran the script with spark-submit.

Failing situation: I take 200k rows that I convert to a pandas DataFrame with the toPandas PySpark functionality. However, I only got StatusType.TIMEOUT logs, and thus the only resulting classifier is the DummyClassifier.
Succeeding one: Now, if I convert this exact same pandas DataFrame of 200k rows to a csv, that I then read with pandas.read_csv, and give it as an input to Auto Sklearn, it works fine and I get a good ensemble of classifiers.

To Reproduce

Test the failing situation.

Expected behavior

Same StatusType.SUCCESS as the 2nd succeeding situation.

Actual behavior, stacktrace or logfile

For the first failing situation, I got those logs:

RunValue(cost=1.0, time=100.08194756507874, status=<StatusType.TIMEOUT: 2>, starttime=1615854288.0671616, endtime=1615854389.1863475, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'gini'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.5

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 1

  classifier:random_forest:min_samples_split, Value: 2

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

 data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.01

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'mean'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'standardize'

  feature_preprocessor:__choice__, Value: 'no_preprocessing'

****************************************

RunValue(cost=1.0, time=100.11368203163147, status=<StatusType.TIMEOUT: 2>, starttime=1615854390.0143688, endtime=1615854491.1564965, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'weighting'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.6792349232781753

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 5

  classifier:random_forest:min_samples_split, Value: 12

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.007478351211361768

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'median'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'standardize'

  feature_preprocessor:__choice__, Value: 'feature_agglomeration'

  feature_preprocessor:feature_agglomeration:affinity, Value: 'euclidean'

  feature_preprocessor:feature_agglomeration:linkage, Value: 'complete'

  feature_preprocessor:feature_agglomeration:n_clusters, Value: 321

  feature_preprocessor:feature_agglomeration:pooling_func, Value: 'mean'

****************************************

RunValue(cost=1.0, time=100.02384996414185, status=<StatusType.TIMEOUT: 2>, starttime=1615854491.847664, endtime=1615854592.898519, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'False'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.20635736497355783

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 3

  classifier:random_forest:min_samples_split, Value: 16

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'no_coalescense'

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'most_frequent'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'quantile_transformer'

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles, Value: 1300

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution, Value: 'normal'

  feature_preprocessor:__choice__, Value: 'feature_agglomeration'

  feature_preprocessor:feature_agglomeration:affinity, Value: 'cosine'

  feature_preprocessor:feature_agglomeration:linkage, Value: 'average'

  feature_preprocessor:feature_agglomeration:n_clusters, Value: 386

  feature_preprocessor:feature_agglomeration:pooling_func, Value: 'median'

****************************************

RunValue(cost=1.0, time=50.05536484718323, status=<StatusType.TIMEOUT: 2>, starttime=1615854593.5711422, endtime=1615854644.6575012, additional_info={'error': 'Timeout', 'configuration_origin': 'Initial design'})

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.912689259437897

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 12

  classifier:random_forest:min_samples_split, Value: 11

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'no_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.2533271508321726

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'median'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'quantile_transformer'

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles, Value: 275

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution, Value: 'normal'

  feature_preprocessor:__choice__, Value: 'no_preprocessing'

****************************************

RunValue(cost=2147483647.0, time=0.0, status=<StatusType.RUNNING: 9>, starttime=0.0, endtime=0.0, additional_info=None)

@@@@

Configuration:

  balancing:strategy, Value: 'none'

  classifier:__choice__, Value: 'random_forest'

  classifier:random_forest:bootstrap, Value: 'True'

  classifier:random_forest:criterion, Value: 'entropy'

  classifier:random_forest:max_depth, Constant: 'None'

  classifier:random_forest:max_features, Value: 0.4617335248365182

  classifier:random_forest:max_leaf_nodes, Constant: 'None'

  classifier:random_forest:min_impurity_decrease, Constant: 0.0

  classifier:random_forest:min_samples_leaf, Value: 1

  classifier:random_forest:min_samples_split, Value: 6

  classifier:random_forest:min_weight_fraction_leaf, Constant: 0.0

  data_preprocessing:categorical_transformer:categorical_encoding:__choice__, Value: 'one_hot_encoding'

  data_preprocessing:categorical_transformer:category_coalescence:__choice__, Value: 'minority_coalescer'

  data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction, Value: 0.021330600464414196

  data_preprocessing:numerical_transformer:imputation:strategy, Value: 'mean'

  data_preprocessing:numerical_transformer:rescaling:__choice__, Value: 'quantile_transformer'

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles, Value: 925

  data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution, Value: 'normal'

  feature_preprocessor:__choice__, Value: 'fast_ica'

  feature_preprocessor:fast_ica:algorithm, Value: 'parallel'

  feature_preprocessor:fast_ica:fun, Value: 'logcosh'

  feature_preprocessor:fast_ica:whiten, Value: 'False'

Environment and installation:

OS: Linux (Centos)
Installation in a virtual environment
Python version: 3.6
Auto-sklearn version: 0.12.3

mfeurer commented 3 years ago

Thanks for reporting this issue @krzischp. I'm not familiar with spark so I'm afraid I can't help here but only guess. Because the 2nd option works I suspect there is some weird interaction between Auto-sklearn and the way spark converts the data into a pandas dataframe. How long does it take the random forest to fit in the working case?

krzischp commented 3 years ago

Hi @mfeurer thanks for the answer!

The memory consumption is pretty horrible when using Spark toPandas() function. It seems like the memory usage is a lot larger than when using pandas directly.

PyArrow could resolve that issue but in my case, I'm using Spark 2.4.0 and I would get a Pandas incompatibility because of other most recent libraries I'm using.

So the simpler solution I found is to convert this toPandas() dataframe into a csv and read it again with pandas.read_csv.

Concerning the Auto Sklearn logs, I just didn't understand why they weren't showing out of memory exceptions after hours of execution. They're only showing timeout, as the executing processes are probably waiting all this time for memory to be freed.

mfeurer commented 3 years ago

They're only showing timeout, as the executing processes are probably waiting all this time for memory to be freed.

Are you running in a sequential fashion? If yes, Auto-sklearn by default uses fork to copy data to the subprocess which might lead to performance degration. You could try setting n_jobs=2 or passing a dask client with a single worker and see if this resolves your problem.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity.

automl / auto-sklearn