Closed michaelkarlcoleman closed 4 years ago
I'm currently running Cactus to align multiple genomes. I'm using my local SLURM and I seem to be having the same issue. I'm using the Evolver example file for an initial test, however, the SLURM is getting bogged down by several job submissions created by Cactus. It looks like its jumping from node to node but not actually performing the alignment. I'm current running this on an IDEV with 20 CPUs.
Initial input:
module load cactus
cactus /data/cornejo/projects/nelson_projects/anopheles_DFE/Anopheles_phylo/cactus evolverMammals.txt temp.hal --binariesMode singularity --batchSystem slurm --maxCores 1
Error code:
RuntimeError: Detected the error jobStoreID has been removed so exiting with an error ERROR:toil.worker:Exiting the worker because of a failed job on host cn122 WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'KtServerService' c/p/jobq5Nuns with ID c/p/jobq5Nuns to 5
Any thoughts?
Unfortunately, the ktserver
program often fails to check I/O error status results. First thing is to avoid "disk full" or "quota exceeded" errors, because ktserver
will not handle them correctly, or even necessarily notice them at all.
Also, toil
had a race that you might be hitting. The symptom we've seen is an error when trying to load a previously saved ktserver
database. I don't see it in your messages, but perhaps you don't have enough debug logging available. See https://github.com/DataBiosphere/toil/issues/2897
The problem with the Toil SLURM batch system and all of the other Grid Engine-like systems is they create a batch per Toil job rather than use array jobs.
Unless someone contributes to maintaining the Toil SLURM batchsystem, this isn't going to change, as we don't have the resources to support it.
Have you tried the Mesos batch system>
The Toul atomic file creation issue that impacted the ktserver database has been fixed in the latest toil release.
@diekhans Indeed--the code snippet I included shows how to use Mesos within SLURM. In this pattern, all of the load of handling the millions of Toil tasks is removed from SLURM and imposed on the mesos processes.
Regarding the initial bug report, note that I'm not asking for any code changes. Rather, I think a warning ought to be placed in the docs (and maybe also the --help text) to ensure that users don't accidentally crush their cluster (likely annoying their colleagues and admins) by assuming that Cactus and SLURM (etc) will "just work" together.
For better or worse, Cactus starts an immense number of extremely short jobs, and many scheduling systems don't handle this well. (Using array jobs might help some, but I suspect it still wouldn't work well.)
(As an optimization, it would be interesting to see whether those ultra-short tasks could be anticipated in advance. In many cases, I suspect it takes longer to round-trip the tasks through toil and onto a remote node and back than it would take to simply execute them within the main Cactus process itself.)
@michaelkarlcoleman yes! I already committed the suggest warning to the README. I have closed this ticket because I want to add your mesos suggestion to the doc.
tbanks!!!
done
Excellent guide @michaelkarlcoleman! Everything worked except that the PATH
-variable in the environment of the mesos slave didn't inherit from the shell that executed it, so _toil_mesos_executor
wasn't found.
We solved it by adding --executor_environment_variables=$(echo {\"PATH\":\"$PATH\"})
when we start mesos-agent.sh
.
This is how we run it, using your example:
mesos-slave --master=$master:$port --port=$(( $port + $n )) \
--credential=$credential \
--work_dir=$TMPDIR/slave-work-dir-$n-$USER-SLURM_JOB_ID \
--no-switch_user \
--resources="cpus:${SLURM_CPUS_PER_TASK};mem:${memory}" \
--no-systemd_enable_support \
--executor_environment_variables=$(echo {\"PATH\":\"$PATH\"})
Regards, Björn and Martin
@N0s3n Very happy to hear it! I'm guessing the $PATH
thing is due to an idiosyncrasy of our environment, or else a change between versions in the way that SLURM propagates environment variables.
In any case, I hope that our collective breadcrumbs will be helpful to future SLURM users!
The new cactus_prepare facility should be useful for some alignments being run on HPC cluster with SLURM and other schedulers
Michael Coleman notifications@github.com writes:
@N0s3n Very happy to hear it! I'm guessing the
$PATH
thing is due to an idiosyncrasy of our environment, or else a change between versions in the way that SLURM propagates environment variables.In any case, I hope that our collective breadcrumbs will be helpful to future SLURM users!
-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/128#issuecomment-619626195 @N0s3n Very happy to hear it! I'm guessing the $PATH thing is due to an idiosyncrasy of our environment, or else a change between versions in the way that SLURM propagates environment variables.
In any case, I hope that our collective breadcrumbs will be helpful to future SLURM users!
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.*
@michaelkarlcoleman and @N0s3n - I wanted to thank you for your script and the additional pointers above. Taking this approach has allowed us to run cactus on our infrastructure.
In case anyone finds additional changes to this script useful, I've modified it to use heterogenous resources with the slurm job-pack approach. In this way, you can crank up leader mesos instances and follower mesos instances in different queues with different resource allocations for each. In the following example, the lead mesos instance and lead cactus process run using fewer resources (our single
queue - meant for jobs using < 1 entire node) while the follower mesos instance(s) doing the work run receive more substantial resources (our checkpt
queue - meant for jobs using > 1 node(s)). CPUs and nodes allocated to each "job-pack" can be adjusted up/down as needed.
#!/bin/bash
#SBATCH --job-name cactus-mesos
#SBATCH --account=loni_bf_qb_03
#SBATCH --time=04:00:00
# --- mesos lead resources ---
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH -p single
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
# --- mesos follower resources ---
#SBATCH packjob
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=48
#SBATCH -p checkpt
# Modified from script by Michael Coleman posted as an issue to github
# https://github.com/ComparativeGenomicsToolkit/cactus/issues/128
#
# e.g., sbatch run-cactus.sbatch evolverMammals.txt
# Modified to use slurm heterogenous/packed jobs. Essentially, we request
# nodes/resources suited to the jobs we are running - fewer cores for the
# mesos lead process and more for the mesos follower processes. The cactus
# and mesos lead process operate on the same node, while you can start any
# number of follower nodes.
# create a temporary directory within our working directory
local_tmpdir=$PWD/tmpdir
rm -rf $local_tmpdir
mkdir $local_tmpdir
# set an environment variable to mount this local dir in the cactus
# container. Although, in my case, the tmpdir is nested in /work
# it is not mounted appropriately, so I mount it explicitly
export SINGULARITY_BINDPATH='/work,/usr/lib64':$local_tmpdir
# check to ensure the configuration file is passed as the only arg
die () {
echo 1>&2 "$0: fatal: $@"
exit 1
}
test "$#" -ne 1 && die 'exactly one arg required (name of tree file)'
# number of tasks is somewhat confusing in this context, so i've removed
# the check of appropriate task number
# SETUP MESOS LEAD - PACK GROUP 0
# get the lead hostname
master=$(hostname)
# set a port on lead
port=5050
#setup credentials to log into the lead from the follower(s)
echo "### mesos master on $master:$port"
secret=$(date --iso-8601=ns | md5sum | cut -f 1 -d ' ')
# those need different formats
credentials=$PWD/mesos.credentials
credential=$PWD/mesos.credential
# avoid exposing our secret in a race
rm -f $credentials $credential
touch $credentials $credential
chmod go-rwx $credentials $credential
echo '{ "credentials": [ { "principal": "'${USER}'", "secret": "'${secret}'" } ] }' >> $credentials
echo '{ "principal": "'${USER}'", "secret": "'${secret}'" }' >> $credential
# start the mesos lead in --pack-group=0 and let it get setup
srun --pack-group=0 -N 1 -n 1 -c $SLURM_CPUS_PER_TASK_PACK_GROUP_0 --nodelist=$master \
mesos-master --port=$port \
--authenticate --authenticate_agents --credentials=$credentials \
--registry=in_memory \
> mesos-master.log 2>&1 &
# Wait a bit for lead to start
sleep 5
echo "LEAD NODE IS SETUP AND RUNNING"
# SETUP MESOS FOLLOWERS - PACK GROUP 1
# compute memory resources for each node
memory=$(( $SLURM_CPUS_PER_TASK_PACK_GROUP_1 * $SLURM_MEM_PER_CPU_PACK_GROUP_1 ))
# set up follower nodes - note change here from original - since we're using
# pack groups, we don't need to subtract anything from the NTASKS of the pack
# group
#
# also note that in my installation I have set:
#
# export MESOS_SYSTEMD_ENABLE_SUPPORT=false
#
# in my ~/.bashrc, and passing --no-systemd_enable_support in addition
# to this environment variable causes a failure, so flag removed. I have
# also incudeed --executor_environment_variables=$(echo {\"PATH\":\"$PATH\"})
# as noted by N0s3n down the thread at
# https://github.com/ComparativeGenomicsToolkit/cactus/issues/128
#
for n in $(seq 1 $SLURM_NTASKS_PACK_GROUP_1); do
echo starting follower $n
srun --pack-group=1 -N 1 -n 1 -c $SLURM_CPUS_PER_TASK_PACK_GROUP_1 --exclusive --job-name=follower-$n-mesos \
mesos-slave --master=$master:$port --port=$(( $port + $n )) \
--credential=$credential \
--work_dir=$local_tmpdir/follower-work-dir-$n-$USER-$SLURM_JOB_ID \
--no-switch_user \
--resources="cpus:${SLURM_CPUS_PER_TASK_PACK_GROUP_1};mem:${memory}" \
--executor_environment_variables=$(echo {\"PATH\":\"$PATH\"}) \
> mesos-slave-$n.log 2>&1 &
done
sleep 15
echo "FOLLOWER NODE(s) SETUP AND RUNNING"
echo "ACTIVATING CACTUS"
source activate cactus
echo "STARTING CACTUS"
# our system has 48 core nodes w/ 192 GB RAM. adjust as needed.
cactus jobStore "$1" "$1".hal \
--batchSystem Mesos \
--mesosMaster $master:$port \
--stats \
--disableCaching \
--containerImage /home/admin/singularity/cactus-1.0.0-ubuntu-16.04.simg \
--binariesMode singularity \
--setEnv SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH \
--clean never \
--cleanWorkDir never \
--maxCores 48 \
--maxMemory 192G \
#
seff $SLURM_JOB_ID
exit 0
For all but the smallest alignments, SLURM doesn't work well as a
--batchSystem
choice. Cactus will generate hundreds of thousands of toil tasks, many quite short (<1s), and in our hands, the SLURM scheduler is rapidly swamped by such a workload. (Other sites might limit the job submission rate to keep this from happening, but this would cause Cactus to run extremely slowly.)It would be useful to document this.
A far better choice is to run Mesos within a single SLURM job (or small set of SLURM jobs), since it is better at dealing with large numbers of tiny jobs.
Here's a model of how this might be done, to assist anyone attempting this. It will have to be adapted some for a particular site. It's probably useful to start some slaves in separate SLURM jobs, so that they can easily be added and removed over the course of a long run.