Closed damienrj closed 5 years ago
Okay, it looks like xgboost is what is failing
distributed.nanny - INFO - Start Nanny at: 'tcp://172.17.0.3:34153'
distributed.worker - INFO - Start worker at: tcp://172.17.0.3:39939
distributed.worker - INFO - Listening to: tcp://172.17.0.3:39939
distributed.worker - INFO - bokeh at: 172.17.0.3:43413
distributed.worker - INFO - nanny at: 172.17.0.3:34153
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 16.82 GB
distributed.worker - INFO - Local Directory: /current/worker-BI7YgD
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.62:8786
distributed.worker - INFO - Registered to: tcp://10.0.0.62:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
retry connect to ip(retry time 1): [127.0.0.1]
retry connect to ip(retry time 2): [127.0.0.1]
retry connect to ip(retry time 3): [127.0.0.1]
retry connect to ip(retry time 4): [127.0.0.1]
connect to (failed): [127.0.0.1]
Socket Connect Error:Connection refused
distributed.nanny - WARNING - Worker process 18 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
It is local host because in this example am running from the scheduler.
It turns out that the problem was XGBoost is that it cares about the ip given in client(IP:Port)
. So the built in libraries like logistic regression will work if you do client(localhost:Port)
via port forwarding because it doesn't try to have to worker connect to localhost. But XGBoost seems to read in the ip address and rabbit trys to talk to localhost instead of the actual worker ip address. So the ip address needs to be the actual ip (or hostname) for the scheduler.
The dask cluster is setup as:
And this works fine with
But this doesn't work with
Or
Any idea of where to find more logs or what is going on?
Task: train_part...
Status | no-worker Priority | (0, 54, 0) Worker Restrictions | set(['tcp://10.0.0.60:42869']) Suspicious | 1
Worker Logs
Scheduler