Currently we run Rabit's central process on the scheduler and the worker processes with the dask workers. This has caused issues in two cases:
Sometimes the scheduler has a more stripped down environment and doesn't have all of the libraries that the workers do.
Sometimes the scheduler's networking position is somewhat different from the workers #23 #40
We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.
@RAMitchell thought I'd bring this up now rather than later in case it affects things
Are we currently fault-tolerant in any way should a single worker die? And if so, is the likelihood of worker-death higher-enough that it should occur more-frequently than on the scheduler, which is presumably running less code/load?
Are there any time-sensitive Rabit tracker tasks which would cause problems if the tracker-worker was under load-resource-pressure?
So for my xgboost integration (https://github.com/dmlc/xgboost/pull/4473) I will try the approach of running the tracker on worker zero and assume the performance load of the tracker is negligible.
Currently we run Rabit's central process on the scheduler and the worker processes with the dask workers. This has caused issues in two cases:
We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.
@RAMitchell thought I'd bring this up now rather than later in case it affects things