Closed jimmywan closed 7 years ago
When you create a Client
with no arguments, a LocalCluster
, which uses multiprocessing to communicate between "workers" (separate processes on your local machine). This will make copies of your dataset for each worker, which I suspect is blowing up your memory budget.
You could try
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=1)
client = Client(cluster)
You won't get any parallelism, but this will confirm whether it's a memory issue or not.
Or if you provide a reproducible example I can take a deeper look.
I recommend Client(processes=False) to avoid the extra copy between client->scheduler->worker
On Wed, Sep 6, 2017 at 3:14 PM, Tom Augspurger notifications@github.com wrote:
When you create a Client with no arguments, a LocalCluster, which uses multiprocessing to communicate between "workers" (separate processes on your local machine). This will make copies of your dataset for each worker, which I suspect is blowing up your memory budget.
You could try
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=1) client = Client(cluster)
You won't get any parallelism, but this will confirm whether it's a memory issue or not.
Or if you provide a reproducible example I can take a deeper look.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-searchcv/issues/57#issuecomment-327585416, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszB6F1a8aYuzORfk4V9XcufAri_70ks5sfu8hgaJpZM4POvYV .
@TomAugspurger and @mrocklin thanks for your suggestions. I've tried both, and unfortunately neither is working for me. The frustrating part of all this is that the exceptions are being thrown from scikit-learn and not dask-searchcv.
I believe both of these might be to blame: scikit-learn/scikit-learn#2755 scikit-learn/scikit-learn#7346
Or if you provide a reproducible example I can take a deeper look.
I'd like to do this, but I've got quite a bit of proprietary customer data/code that I need to trim out before I can share this.
I've slightly updated my example in the OP to have a teeny bit more context which should hopefully help.
I've trimmed out irrelevant bits of the pipeline and it's basically down to this and it still fails:
I'm sort of hoping that someone out there can confirm/deny that these suggestions/configurations work on "something else". Do testcases for pipelines going through GridSearchCV exist in dask-searchcv? If not, perhaps I can help contribute some?
You could try...You won't get any parallelism, but this will confirm whether it's a memory issue or not.
I didn't reach the OOM error, but am getting a similar error(slightly different stack trace):
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 315, in fit
random_state=random_state)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 125, in _make_estimator
estimator = clone(self.base_estimator_)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 63, in clone
params_set = new_object.get_params(deep=False)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 240, in get_params
warnings.filters.pop(0)
IndexError: pop from empty list
I recommend Client(processes=False) to avoid the extra copy between client->scheduler->worker
I tried that but got the same stack trace as the non-distributed use case.
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 315, in fit
random_state=random_state)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 130, in _make_estimator
_set_random_states(estimator, random_state)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 57, in _set_random_states
estimator.set_params(**to_set)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 264, in set_params
valid_params = self.get_params(deep=True)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 240, in get_params
warnings.filters.pop(0)
IndexError: pop from empty list
FWIW, I'm running an Ubuntu Guest on a Windows 10 machine.
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04 LTS
Release: 14.04
Codename: trusty
Went ahead and distilled my code sample as much as I could and it's fairly simple now:
pipe = sklearn.pipeline.Pipeline(
[
('impute', sklearn.preprocessing.Imputer(strategy='median')),
('base_features', sklearn.preprocessing.StandardScaler()),
('base_regressor', ExtraTreesRegressor(random_state=42, n_jobs=1)),
],
)
param_grid = [
{
"base_regressor__min_samples_leaf": [3],
"base_regressor__min_samples_split": [6, 8],
"base_regressor__max_features": [0.25, 0.5, 0.75, 1.0],
"base_regressor__n_estimators": [250],
}
]
cpu_count = multiprocessing.cpu_count()
# Use dask in distributed mode
from distributed import Client, LocalCluster
# c = Client()
# c = Client(processes=False)
cluster = LocalCluster(n_workers=1)
c = Client(cluster)
best_model = GridSearchCV(
pipe,
param_grid=param_grid,
cv=10,
n_jobs=cpu_count,
)
with ProgressBar():
best_model.fit(train_x, train_y)
This is the stacktrace:
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/dask_searchcv/model_selection.py", line 793, in fit
out = scheduler(dsk, keys, num_workers=n_jobs)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/distributed/client.py", line 1923, in get
results = self.gather(packed, asynchronous=asynchronous)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/distributed/client.py", line 1368, in gather
asynchronous=asynchronous)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/distributed/client.py", line 540, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/distributed/utils.py", line 239, in sync
six.reraise(*error[0])
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/six.py", line 686, in reraise
raise value
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/distributed/utils.py", line 227, in f
result[0] = yield make_coro()
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/distributed/client.py", line 1246, in _gather
traceback)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/dask_searchcv/methods.py", line 216, in fit
est.fit(X, y, **fit_params)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 315, in fit
random_state=random_state)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 130, in _make_estimator
_set_random_states(estimator, random_state)
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 52, in _set_random_states
for key in sorted(estimator.get_params(deep=True)):
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/sklearn/base.py", line 240, in get_params
warnings.filters.pop(0)
IndexError: pop from empty list
I'm trying to fix the problem on the scikit-learn side: https://github.com/scikit-learn/scikit-learn/issues/7346#issuecomment-327902191
But still wondering if there's a valid workaround by making dask-searchcv's GridSearchCV do whatever scikit-learn's GridSearchCV is doing to copy/preserve state.
Thanks for the example, you are indeed hitting the linked issue.
But still wondering if there's a valid workaround by making dask-searchcv's GridSearchCV do whatever scikit-learn's GridSearchCV is doing to copy/preserve state.
I don't think so. dask-searchcv isn't the one calling the unsafe get_params
; that's comming from the estimator itself.
this is fixed in scikit-learn master now.
Confirmed that the follow example no long crashes. Thanks @amueller :)
import multiprocessing
import sklearn
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import make_regression
from dask_searchcv import GridSearchCV
train_x, train_y = make_regression()
if __name__ == '__main__':
pipe = sklearn.pipeline.Pipeline(
[
('impute', sklearn.preprocessing.Imputer(strategy='median')),
('base_features', sklearn.preprocessing.StandardScaler()),
('base_regressor', ExtraTreesRegressor(random_state=42, n_jobs=1)),
],
)
param_grid = [
{
"base_regressor__min_samples_leaf": [3],
"base_regressor__min_samples_split": [6, 8],
"base_regressor__max_features": [0.25, 0.5, 0.75, 1.0],
"base_regressor__n_estimators": [250],
}
]
cpu_count = multiprocessing.cpu_count()
# Use dask in distributed mode
from distributed import Client, LocalCluster
# c = Client()
# c = Client(processes=False)
cluster = LocalCluster(n_workers=1)
c = Client(cluster)
best_model = GridSearchCV(
pipe,
param_grid=param_grid,
cv=10,
n_jobs=cpu_count,
)
best_model.fit(train_x, train_y)
I can't reliably get my model pipeline that runs under scikit-learn to work using dask-searchcv, Is there an expectation of how much memory overhead dask will require as compared to scikit-learn? My test dataset is fairly small (< 10MB TSV) and my VM has something on the order of 8GB available to it.
I'm attempting to use dask-searchcv's
GridSearchCV
to build a model pipeline without redundantly executing a few expensive pipeline components. scikit-learn 0.19 has made strides in this area, but I'm looking to the future where I'll want to further parallelize my workload across more marchines and/or handle larger data sets and more complicated pipelines.My dataset is fairly small (< 10MB TSV), so I'm trying to distribute the execution and avoid redundant computation, but don't necessarily need a distributed access to a dataframe at this time.
I modified my code to use dask-searchcv's
GridSearchCV
in "local distributed" mode as follows:I skimmed the documentation for distributed and it seems to support the idea that no additional setup is required. Since it didn't immediately die on me, I'm assuming my config/setup is correct.
threaded_get
. I've noticed that failures are more likely to appear if more than one core is available.distributed.scheduler.KilledWorker
.Example dask-searchcv failure in default mode
I believe this is related to https://github.com/scikit-learn/scikit-learn/issues/2755
Example failure in
local distributed
modeI'm pretty sure this was the victim of the oomkiller.
My environment