dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

Closed javabrett closed 4 years ago

javabrett commented 5 years ago

Discussion

Best to also review the notes in #23.

Currently when starting XGBoost (which has its own cluster/tracker/worker network), dask-xgboost is feeding the hostname of the Client scheduler - e.g. dask-scheduler. The IP/adapter for this hostname is not always available in the container that is actually running the scheduler. This is true in cases where there is a service reverse-proxy, such as when deploying in k8s using the current stable/dask Helm chart, when dask-scheduler and its address point to service/dask-scheduler not pod/dask-scheduler....

The simplest approach to fix is to just allow the Rabit tracker code to choose the local adapter/IP to bind the tracker to (in the container/host running scheduler), which is then advertised to XGBoost Rabit workers via env.

Downsides:

Changes

Testing

To perform a manual test of the bug/fix, you will need:

During testing I found EXTRA_PIP_PACKAGES a two-edged sword - convenient, but the pip installs are long-running and repetitive on each node, and the service doesn't detect when they complete, and the Helm chart doesn't have readiness probes, so the service looks dead until this completes on the Jupyter node. I preferred to build and tag a couple of pairs of local images with dask-xgboost and deps pre-installed - daskdev/dask-notebook and daskdev/dask for pre/post-fix versions of dask-xgboost. To do that, create the Dockerfile below and build/tag four images:

Dockerfile:

ARG BASE_IMAGE
FROM ${BASE_IMAGE}
ARG DASK_XGBOOST_VERSION=master
RUN pip install -U pip && \
    pip install dask-ml git+https://github.com/javabrett/dask-xgboost@${DASK_XGBOOST_VERSION} --upgrade

Run:

You can now deploy the Helm chart to test pre/post fix:

Pre-fix:

export DASK_TAG=xgboost
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Once the cluster is up, go to http://localhost, start a new notebook and run:

from distributed import Client
import dask
from dask_xgboost import XGBClassifier
from dask_ml.datasets import make_classification

client = Client()
X, y = make_classification(chunks=20)
X, y = dask.persist(X, y)
XGBClassifier().fit(X, y)

This will fail with:

/opt/conda/lib/python3.7/site-packages/dask_xgboost/tracker.py in __init__()
    166         for port in range(port, port_end):
    167             try:
--> 168                 logging.error('sock.bind %s:%d', hostIP, port)
    169                 sock.bind((hostIP, port))
    170                 self.port = port

OSError: [Errno 99] Cannot assign requested address

Post-fix:

export DASK_TAG=xgboost-fixed
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Allow all pods to restart, then repeat the notebook test above, which will now pass and return a classifier result.

Fixed #23.

mrocklin commented 5 years ago

cc @RAMitchell you may want to be aware of issues / feature requests like this

@javabrett note that there is a move to push some of this code into XGBoost itself. Your engagement comes at a good time to influence that effort.

gforsyth commented 4 years ago

@TomAugspurger -- any objections to merging this in? I've tested it on a few deployments and it fixes network connectivity issues for the rabit network.
I've opened an issue on dmlc/xgboost referenced above to raise this with them and they seem to be aware of it, with possibly a fix slated for next release, but I think this would be a good stopgap.

TomAugspurger commented 4 years ago

Sure, thanks for checking. Sorry I missed this originally!