dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

XGBoost works with local cluster, but fails with "no-workers" when using distributed. #43

Closed damienrj closed 5 years ago

damienrj commented 5 years ago

The dask cluster is setup as:

dask-scheduler (named $MACHINE-dask-scheduler)
dask-worker" --container-arg="$MACHINE-dask-scheduler:8786
from dask.distributed import Client
from dask_ml.linear_model import LogisticRegression
from dask_ml.xgboost import XGBClassifier
from dask_ml.datasets import make_classification
import dask.dataframe as dd
import dask_xgboost
from dask_ml.model_selection import train_test_split

client = Client('127.0.0.1:8786')

X, y = make_classification(n_samples=20000, n_features=20,
                           chunks=10000, n_informative=4,
                           random_state=0)

And this works fine with

from dask_ml.linear_model import LogisticRegression
lr = LogisticRegression()
model = lr.fit(X, y)

But this doesn't work with

import dask_xgboost
params = {'objective': 'binary:logistic',
          'max_depth': 4, 'eta': 0.01, 'subsample': 0.5, 
          'min_child_weight': 0.5}

bst = dask_xgboost.train(client, params, X, y, num_boost_round=10)

Or

from dask_ml.xgboost import XGBClassifier
est = XGBClassifier()
est.fit(X, y)

Any idea of where to find more logs or what is going on?

Task: train_part...

Status | no-worker Priority | (0, 54, 0) Worker Restrictions | set(['tcp://10.0.0.60:42869']) Suspicious | 1

Worker Logs

distributed.worker - INFO - Start worker at: tcp://10.0.0.59:39471
distributed.worker - INFO - Listening to: tcp://10.0.0.59:39471
distributed.worker - INFO - bokeh at: 10.0.0.59:35569
distributed.worker - INFO - nanny at: 10.0.0.59:33889
distributed.worker - INFO - Waiting to connect to: tcp://damien-dask-scheduler:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 16.82 GB
distributed.worker - INFO - Local Directory: /current/worker-TSgjNV
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://damien-dask-scheduler:8786
distributed.worker - INFO - -------------------------------------------------

Scheduler

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.0.0.54:8786
distributed.scheduler - INFO - bokeh at: :8787
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-zH1301
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://10.0.0.57:41765
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:41765
distributed.scheduler - INFO - Register tcp://10.0.0.58:41237
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:41237
distributed.scheduler - INFO - Receive client connection: Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Remove client Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Remove client Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-cea67f61-9127-11e9-8756-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-d7cc7a23-9127-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:41765
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:41237
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.58:44763
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:44763
distributed.scheduler - INFO - Register tcp://10.0.0.57:45983
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:45983
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:45983
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:44763
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.57:46453
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:46453
distributed.scheduler - INFO - Register tcp://10.0.0.58:44711
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:44711
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Receive client connection: Client-cdb9a29e-9129-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:44711
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:46453
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.58:41009
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:41009
distributed.scheduler - INFO - Register tcp://10.0.0.57:43319
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:43319
distributed.scheduler - INFO - Receive client connection: Client-683c9678-912a-11e9-8023-42010a000036
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:41009
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:43319
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.58:41003
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.58:41003
distributed.scheduler - INFO - Register tcp://10.0.0.57:45287
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.57:45287
distributed.scheduler - INFO - Remove worker tcp://10.0.0.57:45287
distributed.scheduler - INFO - Remove worker tcp://10.0.0.58:41003
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Remove client Client-cdb9a29e-9129-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Remove client Client-d7cc7a23-9127-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-cdb9a29e-9129-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-d7cc7a23-9127-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Register tcp://10.0.0.59:44611
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:44611
distributed.scheduler - INFO - Register tcp://10.0.0.60:45183
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:45183
distributed.scheduler - INFO - Receive client connection: Client-10ed5adc-9130-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:44611
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:45183
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:37507
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:37507
distributed.scheduler - INFO - Register tcp://10.0.0.60:46703
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:46703
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove client Client-10ed5adc-9130-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-10ed5adc-9130-11e9-9779-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-b8b1ff42-913b-11e9-b1c6-8c8590bca016
distributed.scheduler - INFO - Remove client Client-b8b1ff42-913b-11e9-b1c6-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-b8b1ff42-913b-11e9-b1c6-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-22a417cc-913d-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:37507
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:46703
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.59:38681
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:38681
distributed.scheduler - INFO - Register tcp://10.0.0.60:46663
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:46663
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:38681
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:46663
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:35635
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:35635
distributed.scheduler - INFO - Register tcp://10.0.0.60:39269
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:39269
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Receive client connection: Client-84114e0a-913e-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:35635
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:39269
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:45541
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:45541
distributed.scheduler - INFO - Register tcp://10.0.0.60:43083
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:43083
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:45541
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:43083
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.59:37297
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:37297
distributed.scheduler - INFO - Register tcp://10.0.0.60:43473
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:43473
distributed.scheduler - INFO - Remove client Client-84114e0a-913e-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Remove client Client-22a417cc-913d-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-84114e0a-913e-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Close client connection: Client-22a417cc-913d-11e9-b635-8c8590bca016
distributed.scheduler - INFO - Receive client connection: Client-a070d614-913f-11e9-b68e-8c8590bca016
distributed.scheduler - INFO - Send lost future signal to clients
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:37297
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:43473
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Register tcp://10.0.0.59:33001
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:33001
distributed.scheduler - INFO - Register tcp://10.0.0.60:42869
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:42869
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Remove worker tcp://10.0.0.59:33001
distributed.scheduler - INFO - Remove worker tcp://10.0.0.60:42869
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register tcp://10.0.0.59:39471
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.59:39471
distributed.scheduler - INFO - Register tcp://10.0.0.60:36841
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.0.60:36841
sq-blocks>=0.6.1
bokeh
dask-ml[complete] 
distributed >= 1.15.2
gcsfs
jupyterlab
pyarrow==0.12.1 # 0.13.0 has bug for dask
dask_xgboost
xgboost==0.81.0
damienrj commented 5 years ago

Okay, it looks like xgboost is what is failing

distributed.nanny - INFO -         Start Nanny at: 'tcp://172.17.0.3:34153'
distributed.worker - INFO -       Start worker at:     tcp://172.17.0.3:39939
distributed.worker - INFO -          Listening to:     tcp://172.17.0.3:39939
distributed.worker - INFO -              bokeh at:           172.17.0.3:43413
distributed.worker - INFO -              nanny at:           172.17.0.3:34153
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   16.82 GB
distributed.worker - INFO -       Local Directory:     /current/worker-BI7YgD
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.0.62:8786
distributed.worker - INFO -         Registered to:       tcp://10.0.0.62:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
retry connect to ip(retry time 1): [127.0.0.1]
retry connect to ip(retry time 2): [127.0.0.1]
retry connect to ip(retry time 3): [127.0.0.1]
retry connect to ip(retry time 4): [127.0.0.1]
connect to (failed): [127.0.0.1]
Socket Connect Error:Connection refused
distributed.nanny - WARNING - Worker process 18 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

It is local host because in this example am running from the scheduler.

damienrj commented 5 years ago

It turns out that the problem was XGBoost is that it cares about the ip given in client(IP:Port). So the built in libraries like logistic regression will work if you do client(localhost:Port) via port forwarding because it doesn't try to have to worker connect to localhost. But XGBoost seems to read in the ip address and rabbit trys to talk to localhost instead of the actual worker ip address. So the ip address needs to be the actual ip (or hostname) for the scheduler.