automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.51k stars 1.27k forks source link

Error running "fit" with many cores. #1236

Open sofidenner opened 2 years ago

sofidenner commented 2 years ago

Hi! I'm experiencing a problem when I fit an AutoSklearn instance in a virtual machine with many cores.

I have run exactly the same code, with the same dataset in three different virtual machines:

in a vm with 4 cores and 15Gb of RAM: works ok ✅ in a vm with 8 cores and 30Gb of RAM: works ok ✅ in a vm with 40 cores and 157 Gb of RAM: fails ❌ with the following error:

ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap\n self.run()\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run\n self._target(*self._args, **self._kwargs)\n File "/usr/local/lib/python3.7/site-packages/pynisher/limit_function_call.py", line 133, in subprocess_func\n return_value = ((func(*args, **kwargs), 0))\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/__init__.py", line 40, in fit_predict_try_except_decorator\n return ta(queue=queue, **kwargs)\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 1164, in eval_holdout\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 194, in __init__\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 199, in __init__\n threadpool_limits(limits=1)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 171, in __init__\n self._original_info = self._set_threadpool_limits()\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 280, in _set_threadpool_limits\n module.set_num_threads(num_threads)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 659, in set_num_threads\n return set_func(num_threads)\nKeyboardInterrupt\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.

This is the code I was running:

automl = AutoSklearnClassifier(time_left_for_this_task=600, metric=roc_auc)
automl.fit(x_train, y_train, x_validation, y_validation)

Limiting the number of cores with the param nproc seems to work, but it's a pity that we cannot take advantage of larger infra :(

The dataset doesn't seem to be the problem. I reproduced the bug with datasets of different sizes and different feature types, and everytime it raises the same error (it's not something that happens stochastically).

Also, the error is almost instantaneous: clearly it doesn't even start to fit when it fails.

Environment and installation:

sofidenner commented 2 years ago

The workaround I found to fix this issue is to limit the number of cores with the env var OPENBLAS_NUM_THREADS before importing anything from autosklearn.

For example:

import os

os.environ["OPENBLAS_NUM_THREADS"] = "8"

from autosklearn(...)
ricoms commented 2 years ago

hello. I'm having a similar issue, and that solution does not work for me.

I'm running auto-sklearn = "0.14.0" on MacBook 16 cores (not M1)

eddiebergman commented 2 years ago

Hi @sofidenner,

We don't have infrastructure (a machine with that many cores) to actually test this properly which makes this difficult but we just want to write here to say we are aware of the issue and sorry that we have no response as of yet.

felidsche commented 2 years ago

Hi @sofidenner and others, can you make sure that the resources that you are providing for fit() are actually available? I managed to work around this error by freeing up resources on my machine. However, this could also be a coincidence because this error also occurs occasionally for me.

sofidenner commented 2 years ago

Hi @felidsche, in my case the resources are not the problem: having more than 150GB of RAM free, and running an experiment with an incredibly small dataset results in the same error, every time I run it.

This is the snippet with the incredibly small dataset that I just use to try it out:

import pandas
from autosklearn.estimators import AutoSklearnClassifier

train_x = pandas.DataFrame(
    {"column1": [1, 2, 3, 10, 20, 30]}
)

train_y = [True, True, True, False, False, False]

validation_x = pandas.DataFrame(
    {"column1": [10, 20, 30, 1, 2, 3]}
)
validation_y = [False, False, False, True, True, True]

automl = AutoSklearnClassifier()
automl.fit(train_x, train_y, validation_x, validation_y)

And this is the complete Traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1006/2928451066.py in <module>
     14 
     15 automl = AutoSklearnClassifier()
---> 16 automl.fit(train_x, train_y, validation_x, validation_y)

/usr/local/lib/python3.7/site-packages/autosklearn/estimators.py in fit(self, X, y, X_test, y_test, feat_type, dataset_name)
    945             y_test=y_test,
    946             feat_type=feat_type,
--> 947             dataset_name=dataset_name,
    948         )
    949 

/usr/local/lib/python3.7/site-packages/autosklearn/estimators.py in fit(self, **kwargs)
    338         if self.automl_ is None:
    339             self.automl_ = self.build_automl()
--> 340         self.automl_.fit(load_models=self.load_models, **kwargs)
    341 
    342         return self

/usr/local/lib/python3.7/site-packages/autosklearn/automl.py in fit(self, X, y, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models)
   1662             only_return_configuration_space=only_return_configuration_space,
   1663             load_models=load_models,
-> 1664             is_classification=True,
   1665         )
   1666 

/usr/local/lib/python3.7/site-packages/autosklearn/automl.py in fit(self, X, y, task, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models, is_classification)
    640         # == Perform dummy predictions
    641         # Dummy prediction always have num_run set to 1
--> 642         self.num_run += self._do_dummy_prediction(datamanager, num_run=1)
    643 
    644         # == RUN ensemble builder

/usr/local/lib/python3.7/site-packages/autosklearn/automl.py in _do_dummy_prediction(self, datamanager, num_run)
    422                 raise ValueError(
    423                     "Dummy prediction failed with run state %s and additional output: %s."
--> 424                     % (str(status), str(additional_info))
    425                 )
    426         return num_run

ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap\n    self.run()\n  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run\n    self._target(*self._args, **self._kwargs)\n  File "/usr/local/lib/python3.7/site-packages/pynisher/limit_function_call.py", line 133, in subprocess_func\n    return_value = ((func(*args, **kwargs), 0))\n  File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/__init__.py", line 40, in fit_predict_try_except_decorator\n    return ta(queue=queue, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 1164, in eval_holdout\n    budget_type=budget_type,\n  File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 194, in __init__\n    budget_type=budget_type,\n  File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 199, in __init__\n    threadpool_limits(limits=1)\n  File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 354, in __init__\n    super().__init__(ThreadpoolController(), limits=limits, user_api=user_api)\n  File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 159, in __init__\n    self._set_threadpool_limits()\n  File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 285, in _set_threadpool_limits\n    lib_controller.set_num_threads(num_threads)\n  File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 809, in set_num_threads\n    return set_func(num_threads)\nKeyboardInterrupt\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
raphaelTrench commented 2 years ago

I have been getting this error as well on macOS Monterey 12.0 and auto-sklearn==0.13.0, and I have not updated any libraries in my environment before this error started showing up. It happens when calling fit regardless of parameters:


File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/c91195a/Documents/experian/dragon/dragon/console.py", line 504, in train_console
    train(
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/gin/config.py", line 1069, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/gin/config.py", line 1046, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/Users/c91195a/Documents/experian/dragon/dragon/train.py", line 436, in train
    experiment.run()
  File "/Users/c91195a/Documents/experian/dragon/dragon/experiment/experiment.py", line 180, in run
    self.__fit()
  File "/Users/c91195a/Documents/experian/dragon/dragon/experiment/experiment.py", line 52, in __fit
    self.ml_estimator.fit(
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/experimental/askl2.py", line 425, in fit
    return super().fit(
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/estimators.py", line 941, in fit
    super().fit(
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/estimators.py", line 340, in fit
    self.automl_.fit(load_models=self.load_models, **kwargs)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/automl.py", line 1655, in fit
    return super().fit(
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/automl.py", line 642, in fit
    self.num_run += self._do_dummy_prediction(datamanager, num_run=1)
  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/automl.py", line 422, in _do_dummy_prediction
    raise ValueError(
ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap\n    self.run()\n  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run\n    self._target(*self._args, **self._kwargs)\n  File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/pynisher/limit_function_call.py", line 108, in subprocess_func\n    resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
  In call to configurable 'train' (<function train at 0x7f9cbcc5f8b0>)
`
``
mfeurer commented 2 years ago

Hey @raphaelTrench you are getting a different error message that is not due to the number of cores. Instead, the memory limit you provide is above what can be passed to MacOS. You may try with a lower memory limit to see whether this issue goes away, but we don't test OSX and therefore cannot make any guarantees about the behavior of Auto-sklearn under OSX.

raphaelTrench commented 2 years ago

Hey @mfeurer, I understand. I tried with a lower memory configuration and it still didnt go away. However, in the case anyone faces this issue like me: the error stopped happening when I downgraded my MacOS version to below 12.0 (Monterey).

erinaldi commented 2 years ago

I get the same error https://github.com/automl/auto-sklearn/issues/360#issuecomment-963293965 On macOS Monterey with M1 Pro The installation was successful and importing the package works. It seems to be related to this. I understand auto-sklearn is not tested on macOS but I thought about reporting this known issue anyway in case someone finds a solution (which does not require downgrading the OS)

eddiebergman commented 2 years ago

Hey @erinaldi,

We recently started reworking pynisher which is in charge of limiting resources for spawned processes. This error line is directly from pynisher and is in the comment you linked: resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit.

We have another push on getting it to work tomorrow hopefully but we still need a solution for Windows before we can make a release on that.

If you'd like more context or have any solutions, we can use the builtin python module resources for limiting memory on Unix based systems but there is no windows equivalent, it's a unix only module. We need to find a substitute and then set up some local testing for it (we have no windows machines). There's also other discrepancies between the three core operating systems.

The error above seems to happen regardless of the memory you provide for RLIMIT_XXX and we think that RLIMIT_AS only works for Linux, or at least doesn't work on newer MAC OS systems.

If we can't get a windows version working soon, we will push the Mac fixed version as soon as we can and hopefully it will solve the issue for you :)

Best, Eddie

erinaldi commented 2 years ago

Thanks for the quick reply @eddiebergman

I would be happy to test it on macOS Monterey when you have a working PR. I did try different limits with no success like you said.

Right now this is not impacting my works since I am able to use a many-core Linux system but I’ll check this thread for any future update.

eddiebergman commented 2 years ago

Hi @erinaldi,

I've updated the current status of this issue in Pynisher if you're interested: automl/pynisher#16

jtlz2 commented 1 year ago

So is this issue now resolved? :\

JonathanLehner commented 1 year ago

Is it?

eyalElb commented 1 year ago

it still happens to me...

grzesir commented 11 months ago

still getting this issue on 0.15.0

full output:

`ValueError Traceback (most recent call last) Cell In[16], line 54 27 y_train = rolling_train['close_shift_15'] 29 ######### 30 ### Autogluon 31 ######### (...) 52 ### AutoSklearn 53 ######### ---> 54 rf = cls.fit(X_train, y_train) 56 # rf = RandomForestRegressor().fit(X, y) # Original code 57 58 # Predict on test_vals 59 X_test = test_vals.drop('close_shift_15', axis = 1)

File ~/opt/anaconda3/lib/python3.8/site-packages/autosklearn/estimators.py:1587, in AutoSklearnRegressor.fit(self, X, y, X_test, y_test, feat_type, dataset_name) 1576 raise ValueError( 1577 "Regression with data of type {} is " 1578 "not supported. Supported types are {}. " (...) 1582 "".format(target_type, supported_types) 1583 ) 1585 # Fit is supposed to be idempotent! 1586 # But not if we use share_mode. ... 488 self._logger.error(msg) --> 489 raise ValueError(msg) 491 return

ValueError: (' Dummy prediction failed with run state StatusType.CRASHED and additional output: {\'error\': \'Result queue is empty\', \'exit_status\': "", \'subprocess_stdout\': \'\', \'subprocess_stderr\': \'Process pynisher function call:\nTraceback (most recent call last):\n File "/Users/robertgrzesik/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap\n self.run()\n File "/Users/robertgrzesik/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run\n self._target(*self._args, **self._kwargs)\n File "/Users/robertgrzesik/opt/anaconda3/lib/python3.8/site-packages/pynisher/limit_function_call.py", line 108, in subprocess_func\n resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit\n\', \'exitcode\': 1, \'configuration_origin\': \'DUMMY\'}.',) Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...`

Parvez-Khan-1 commented 11 months ago

The workaround I found to fix this issue is to limit the number of cores with the env var OPENBLAS_NUM_THREADS before importing anything from autosklearn.

For example:

import os

os.environ["OPENBLAS_NUM_THREADS"] = "8"

from autosklearn(...)

@sofidenner It works fine from a python file (.py file) but when I am trying to execute it through jupyter notebook its still throwing the same error.

though, I cross verified that the environment variable is set properly and I can print it using os.environ['OPENBLAS_NUM_THREADS']