Closed javabrett closed 4 years ago
cc @RAMitchell you may want to be aware of issues / feature requests like this
@javabrett note that there is a move to push some of this code into XGBoost itself. Your engagement comes at a good time to influence that effort.
@TomAugspurger -- any objections to merging this in? I've tested it on a few deployments and it fixes network connectivity issues for the rabit network.
I've opened an issue on dmlc/xgboost
referenced above to raise this with them and they seem to be aware of it, with possibly a fix slated for next release, but I think this would be a good stopgap.
Sure, thanks for checking. Sorry I missed this originally!
Discussion
Best to also review the notes in #23.
Currently when starting XGBoost (which has its own cluster/tracker/worker network),
dask-xgboost
is feeding the hostname of theClient
scheduler - e.g.dask-scheduler
. The IP/adapter for this hostname is not always available in the container that is actually running the scheduler. This is true in cases where there is a service reverse-proxy, such as when deploying in k8s using the currentstable/dask
Helm chart, whendask-scheduler
and its address point toservice/dask-scheduler
notpod/dask-scheduler...
.The simplest approach to fix is to just allow the Rabit tracker code to choose the local adapter/IP to bind the tracker to (in the container/host running scheduler), which is then advertised to XGBoost Rabit workers via
env
.Downsides:
service/dask-scheduler
. Probably not a big concern given the Rabit network should be short-lived, and restartable on any new scheduler/worker pods.Client
scheduler hostname anyway.Changes
start_tracker
now acceptshost=None
and in that case calls Rabit codeget_host_ip('auto')
, which attempts to find the best local adapter addressclient._run_on_scheduler(start_tracker
passeshost=None
to trigger this logicTesting
To perform a manual test of the bug/fix, you will need:
helm --init
.During testing I found
EXTRA_PIP_PACKAGES
a two-edged sword - convenient, but thepip
installs are long-running and repetitive on each node, and the service doesn't detect when they complete, and the Helm chart doesn't have readiness probes, so the service looks dead until this completes on the Jupyter node. I preferred to build and tag a couple of pairs of local images withdask-xgboost
and deps pre-installed -daskdev/dask-notebook
anddaskdev/dask
for pre/post-fix versions ofdask-xgboost
. To do that, create theDockerfile
below and build/tag four images:Dockerfile
:Run:
docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" -t daskdev/dask:xgboost .
docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" -t daskdev/dask-notebook:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask-notebook:xgboost-fixed .
You can now deploy the Helm chart to test pre/post fix:
Pre-fix:
Once the cluster is up, go to http://localhost, start a new notebook and run:
This will fail with:
Post-fix:
Allow all pods to restart, then repeat the notebook test above, which will now pass and return a classifier result.
Fixed #23.