There exists code for using the SLURM scheduler as a tracker for distributed training, but it was removed as an option from submit.py some time ago.
Lately I've been training XGBoost using an MPI cluster and while I haven't been able to get the mpi tracker to work, re-instating the SLURM tracker seems to work, after I made some changes to the command being called.
So would the community consider adding back SLURM as an option or is it supposed to be superseded by the mpi tracker now? In that case has anyone gotten the MPI tracker to train XGBoost recently?
There exists code for using the SLURM scheduler as a tracker for distributed training, but it was removed as an option from
submit.py
some time ago.Lately I've been training XGBoost using an MPI cluster and while I haven't been able to get the
mpi
tracker to work, re-instating the SLURM tracker seems to work, after I made some changes to the command being called.So would the community consider adding back SLURM as an option or is it supposed to be superseded by the
mpi
tracker now? In that case has anyone gotten the MPI tracker to train XGBoost recently?