javabrett commented 5 years ago

Discussion

Best to also review the notes in #23.

Currently when starting XGBoost (which has its own cluster/tracker/worker network), dask-xgboost is feeding the hostname of the Client scheduler - e.g. dask-scheduler. The IP/adapter for this hostname is not always available in the container that is actually running the scheduler. This is true in cases where there is a service reverse-proxy, such as when deploying in k8s using the current stable/dask Helm chart, when dask-scheduler and its address point to service/dask-scheduler not pod/dask-scheduler....

The simplest approach to fix is to just allow the Rabit tracker code to choose the local adapter/IP to bind the tracker to (in the container/host running scheduler), which is then advertised to XGBoost Rabit workers via env.

Downsides:

In k8s case, the XGBoost Rabit tracker is accessed directly between Dask scheduler/worker nodes, i.e. not via the longer-running service/dask-scheduler. Probably not a big concern given the Rabit network should be short-lived, and restartable on any new scheduler/worker pods.
Bad host networking configuration could conceivably choose a bad adapter for the Rabit tracker to bind to - but it should choose an adapter which matches the Client scheduler hostname anyway.

Changes

start_tracker now accepts host=None and in that case calls Rabit code get_host_ip('auto'), which attempts to find the best local adapter address
client._run_on_scheduler(start_tracker passes host=None to trigger this logic
Added some logging

Testing

To perform a manual test of the bug/fix, you will need:

Docker
A k8s cluster - probably any will do, certainly either Docker Desktop for Mac 2.0.0.3 with local k8s, or AWS EKS will work. minicube will also probably work but not tested.
Helm client
Tiller installed, helm --init.

During testing I found EXTRA_PIP_PACKAGES a two-edged sword - convenient, but the pip installs are long-running and repetitive on each node, and the service doesn't detect when they complete, and the Helm chart doesn't have readiness probes, so the service looks dead until this completes on the Jupyter node. I preferred to build and tag a couple of pairs of local images with dask-xgboost and deps pre-installed - daskdev/dask-notebook and daskdev/dask for pre/post-fix versions of dask-xgboost. To do that, create the Dockerfile below and build/tag four images:

Dockerfile:

ARG BASE_IMAGE
FROM ${BASE_IMAGE}
ARG DASK_XGBOOST_VERSION=master
RUN pip install -U pip && \
    pip install dask-ml git+https://github.com/javabrett/dask-xgboost@${DASK_XGBOOST_VERSION} --upgrade

Run:

docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" -t daskdev/dask:xgboost .
docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" -t daskdev/dask-notebook:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask-notebook:xgboost-fixed .

You can now deploy the Helm chart to test pre/post fix:

Pre-fix:

export DASK_TAG=xgboost
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Once the cluster is up, go to http://localhost, start a new notebook and run:

from distributed import Client
import dask
from dask_xgboost import XGBClassifier
from dask_ml.datasets import make_classification

client = Client()
X, y = make_classification(chunks=20)
X, y = dask.persist(X, y)
XGBClassifier().fit(X, y)

This will fail with:

/opt/conda/lib/python3.7/site-packages/dask_xgboost/tracker.py in __init__()
    166         for port in range(port, port_end):
    167             try:
--> 168                 logging.error('sock.bind %s:%d', hostIP, port)
    169                 sock.bind((hostIP, port))
    170                 self.port = port

OSError: [Errno 99] Cannot assign requested address

Post-fix:

export DASK_TAG=xgboost-fixed
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Allow all pods to restart, then repeat the notebook test above, which will now pass and return a classifier result.

Fixed #23.

mrocklin commented 5 years ago

cc @RAMitchell you may want to be aware of issues / feature requests like this

@javabrett note that there is a move to push some of this code into XGBoost itself. Your engagement comes at a good time to influence that effort.

gforsyth commented 4 years ago

@TomAugspurger -- any objections to merging this in? I've tested it on a few deployments and it fixes network connectivity issues for the rabit network.
I've opened an issue on dmlc/xgboost referenced above to raise this with them and they seem to be aware of it, with possibly a fix slated for next release, but I think this would be a good stopgap.

TomAugspurger commented 4 years ago

Sure, thanks for checking. Sorry I missed this originally!

dask / dask-xgboost

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

Discussion

Changes

Testing

Pre-fix:

Post-fix: