dmlc / dmlc-core

A common bricks library for building scalable and portable distributed machine learning.
Apache License 2.0
864 stars 520 forks source link

Restore slurm tracker? #572

Open thvasilo opened 5 years ago

thvasilo commented 5 years ago

There exists code for using the SLURM scheduler as a tracker for distributed training, but it was removed as an option from submit.py some time ago.

Lately I've been training XGBoost using an MPI cluster and while I haven't been able to get the mpi tracker to work, re-instating the SLURM tracker seems to work, after I made some changes to the command being called.

So would the community consider adding back SLURM as an option or is it supposed to be superseded by the mpi tracker now? In that case has anyone gotten the MPI tracker to train XGBoost recently?