document SLURM as poor --batchSystem choice

michaelkarlcoleman commented 4 years ago

For all but the smallest alignments, SLURM doesn't work well as a --batchSystem choice. Cactus will generate hundreds of thousands of toil tasks, many quite short (<1s), and in our hands, the SLURM scheduler is rapidly swamped by such a workload. (Other sites might limit the job submission rate to keep this from happening, but this would cause Cactus to run extremely slowly.)

It would be useful to document this.

A far better choice is to run Mesos within a single SLURM job (or small set of SLURM jobs), since it is better at dealing with large numbers of tiny jobs.

Here's a model of how this might be done, to assist anyone attempting this. It will have to be adapted some for a particular site. It's probably useful to start some slaves in separate SLURM jobs, so that they can easily be added and removed over the course of a long run.

#!/bin/bash

# e.g., sbatch ./run-cactus.sbatch seqFile.txt

#SBATCH whatever...

# Use sbatch parameters to request N nodes, with other parameters chosen so
# that one task runs on each node and probably requests all of the node's CPU
# and memory resources.  The cactus script and mesos master will run on one
# node and one mesos slave will run on each additional node.

module purge
module load slurm

# singularity 2 seems not to work due to a subtlety in the way bind mounts work.
# singularity 3 will give an error due 'pull --size', but apparently works anyway.
# module load singularity/2.6.1
module load singularity/3.3.0

# Cactus passes physical paths to singularity in some cases, so /gpfs must be
# bound.  Unsure whether the other two are required, but may as well.  This
# variable must pass to the mesos-slave processes as well (which srun does).
export SINGULARITY_BIND='/gpfs,/projects,/packages'

module load mesos/1.9.0

module load miniconda/20190102
conda activate cactus

die () {
    echo 1>&2 "$0: fatal: $@"
    exit 1
}

test "$#" -ne 1 && die 'exactly one arg required (name of tree file)'

# this script, the mesos-master, and at least one mesos-slave
min_ntasks=2
if (( $SLURM_NTASKS < $min_ntasks )); then
    die "must specify at least ${min_ntasks} SLURM tasks"
fi

master=$(hostname)
port=$(shuf -i 10001-64999 -n 1)
echo "### mesos master on $master:$port"

secret=$(date --iso-8601=ns | md5sum | cut -f 1 -d ' ')
# different formats, ugh
credentials=$PWD/mesos.credentials
credential=$PWD/mesos.credential

# avoid exposing our secret in a race
rm -f $credentials $credential
touch $credentials $credential
chmod go-rwx $credentials $credential
echo '{ "credentials": [ { "principal": "'${USER}'", "secret": "'${secret}'" } ] }' >> $credentials
echo                    '{ "principal": "'${USER}'", "secret": "'${secret}'" }'     >> $credential

srun -N 1 -n 1 -c $SLURM_CPUS_PER_TASK --nodelist=$master --job-name=master-mesos \
     mesos-master --port=$port \
     --authenticate --authenticate_agents --credentials=$credentials \
     --registry=in_memory \
     > mesos-master.log 2>&1 &

# not clear whether master must start before slaves, but wait a bit anyway
sleep 5

# in MB
memory=$(( $SLURM_CPUS_PER_TASK * $SLURM_MEM_PER_CPU ))

for n in $(seq 1 $(( $SLURM_NTASKS - 1 )) ); do
    echo starting slave $n
    mkdir mesos-logs-$n
    srun -N 1 -n 1 -c $SLURM_CPUS_PER_TASK --job-name=slave-$n-mesos \
         mesos-slave --master=$master:$port --port=$(( $port + $n )) \
         --credential=$credential \
         --work_dir=$TMPDIR/slave-work-dir-$n-$USER-SLURM_JOB_ID \
         --no-switch_user \
         --resources="cpus:${SLURM_CPUS_PER_TASK};mem:${memory}" \
         --no-systemd_enable_support \
         > mesos-slave-$n.log 2>&1 &
done

sleep 60

rm -fr workDir
mkdir workDir

# Config files copied from installed source:
#
#   cp $CONDA_PREFIX/lib/python2.7/site-packages/cactus/cactus_*config.xml .
#
# Note that there's a bug (fixed locally) where Cactus tries to *write* the
# installed config files, even if --configFile is specified:
#
#  https://github.com/ComparativeGenomicsToolkit/cactus/issues/102

# Using --containerImage= for reproducibility/convenience.  Omit to pull from
# repo on each run.

cactus file:./jobStore "$1" "$1".hal \
       --batchSystem Mesos \
       --mesosMaster $master:$port \
       --binariesMode singularity \
       --stats \
       --disableCaching \
       \
       --containerImage /packages/cactus/cactus-20191011.img \
       \
       --setEnv SINGULARITY_BIND=$SINGULARITY_BIND \
       \
#

exit 0

JTNelsonWSU commented 4 years ago

I'm currently running Cactus to align multiple genomes. I'm using my local SLURM and I seem to be having the same issue. I'm using the Evolver example file for an initial test, however, the SLURM is getting bogged down by several job submissions created by Cactus. It looks like its jumping from node to node but not actually performing the alignment. I'm current running this on an IDEV with 20 CPUs.

Initial input: module load cactus

cactus /data/cornejo/projects/nelson_projects/anopheles_DFE/Anopheles_phylo/cactus evolverMammals.txt temp.hal --binariesMode singularity --batchSystem slurm --maxCores 1

Error code: RuntimeError: Detected the error jobStoreID has been removed so exiting with an error ERROR:toil.worker:Exiting the worker because of a failed job on host cn122 WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'KtServerService' c/p/jobq5Nuns with ID c/p/jobq5Nuns to 5

Any thoughts?

michaelkarlcoleman commented 4 years ago

Unfortunately, the ktserver program often fails to check I/O error status results. First thing is to avoid "disk full" or "quota exceeded" errors, because ktserver will not handle them correctly, or even necessarily notice them at all.

Also, toil had a race that you might be hitting. The symptom we've seen is an error when trying to load a previously saved ktserver database. I don't see it in your messages, but perhaps you don't have enough debug logging available. See https://github.com/DataBiosphere/toil/issues/2897

diekhans commented 4 years ago

The problem with the Toil SLURM batch system and all of the other Grid Engine-like systems is they create a batch per Toil job rather than use array jobs.

Unless someone contributes to maintaining the Toil SLURM batchsystem, this isn't going to change, as we don't have the resources to support it.

Have you tried the Mesos batch system>

diekhans commented 4 years ago

The Toul atomic file creation issue that impacted the ktserver database has been fixed in the latest toil release.

michaelkarlcoleman commented 4 years ago

@diekhans Indeed--the code snippet I included shows how to use Mesos within SLURM. In this pattern, all of the load of handling the millions of Toil tasks is removed from SLURM and imposed on the mesos processes.

Regarding the initial bug report, note that I'm not asking for any code changes. Rather, I think a warning ought to be placed in the docs (and maybe also the --help text) to ensure that users don't accidentally crush their cluster (likely annoying their colleagues and admins) by assuming that Cactus and SLURM (etc) will "just work" together.

For better or worse, Cactus starts an immense number of extremely short jobs, and many scheduling systems don't handle this well. (Using array jobs might help some, but I suspect it still wouldn't work well.)

(As an optimization, it would be interesting to see whether those ultra-short tasks could be anticipated in advance. In many cases, I suspect it takes longer to round-trip the tasks through toil and onto a remote node and back than it would take to simply execute them within the main Cactus process itself.)

diekhans commented 4 years ago

@michaelkarlcoleman yes! I already committed the suggest warning to the README. I have closed this ticket because I want to add your mesos suggestion to the doc.

tbanks!!!

diekhans commented 4 years ago

done

N0s3n commented 4 years ago

Excellent guide @michaelkarlcoleman! Everything worked except that the PATH-variable in the environment of the mesos slave didn't inherit from the shell that executed it, so _toil_mesos_executor wasn't found. We solved it by adding --executor_environment_variables=$(echo {\"PATH\":\"$PATH\"}) when we start mesos-agent.sh.

This is how we run it, using your example:

mesos-slave --master=$master:$port --port=$(( $port + $n )) \
         --credential=$credential \
         --work_dir=$TMPDIR/slave-work-dir-$n-$USER-SLURM_JOB_ID \
         --no-switch_user \
         --resources="cpus:${SLURM_CPUS_PER_TASK};mem:${memory}" \
         --no-systemd_enable_support \
         --executor_environment_variables=$(echo {\"PATH\":\"$PATH\"})

Regards, Björn and Martin

michaelkarlcoleman commented 4 years ago

@N0s3n Very happy to hear it! I'm guessing the $PATH thing is due to an idiosyncrasy of our environment, or else a change between versions in the way that SLURM propagates environment variables.

In any case, I hope that our collective breadcrumbs will be helpful to future SLURM users!

diekhans commented 4 years ago

The new cactus_prepare facility should be useful for some alignments being run on HPC cluster with SLURM and other schedulers

Michael Coleman notifications@github.com writes:

@N0s3n Very happy to hear it! I'm guessing the $PATH thing is due to an idiosyncrasy of our environment, or else a change between versions in the way that SLURM propagates environment variables.

In any case, I hope that our collective breadcrumbs will be helpful to future SLURM users!

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/128#issuecomment-619626195 @N0s3n Very happy to hear it! I'm guessing the $PATH thing is due to an idiosyncrasy of our environment, or else a change between versions in the way that SLURM propagates environment variables.

In any case, I hope that our collective breadcrumbs will be helpful to future SLURM users!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.*

brantfaircloth commented 4 years ago

@michaelkarlcoleman and @N0s3n - I wanted to thank you for your script and the additional pointers above. Taking this approach has allowed us to run cactus on our infrastructure.

In case anyone finds additional changes to this script useful, I've modified it to use heterogenous resources with the slurm job-pack approach. In this way, you can crank up leader mesos instances and follower mesos instances in different queues with different resource allocations for each. In the following example, the lead mesos instance and lead cactus process run using fewer resources (our single queue - meant for jobs using < 1 entire node) while the follower mesos instance(s) doing the work run receive more substantial resources (our checkpt queue - meant for jobs using > 1 node(s)). CPUs and nodes allocated to each "job-pack" can be adjusted up/down as needed.

#!/bin/bash
#SBATCH --job-name cactus-mesos
#SBATCH --account=loni_bf_qb_03
#SBATCH --time=04:00:00
# --- mesos lead resources ---
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH -p single
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
# --- mesos follower resources ---
#SBATCH packjob
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=48
#SBATCH -p checkpt

# Modified from script by Michael Coleman posted as an issue to github
# https://github.com/ComparativeGenomicsToolkit/cactus/issues/128
#
# e.g., sbatch run-cactus.sbatch evolverMammals.txt

# Modified to use slurm heterogenous/packed jobs.  Essentially, we request
# nodes/resources suited to the jobs we are running - fewer cores for the
# mesos lead process and more for the mesos follower processes.  The cactus
# and mesos lead process operate on the same node, while you can start any
# number of follower nodes.

# create a temporary directory within our working directory
local_tmpdir=$PWD/tmpdir
rm -rf $local_tmpdir
mkdir $local_tmpdir

# set an environment variable to mount this local dir in the cactus 
# container. Although, in my case, the tmpdir is nested in /work
# it is not mounted appropriately, so I mount it explicitly
export SINGULARITY_BINDPATH='/work,/usr/lib64':$local_tmpdir

# check to ensure the configuration file is passed as the only arg
die () {
    echo 1>&2 "$0: fatal: $@"
    exit 1
}
test "$#" -ne 1 && die 'exactly one arg required (name of tree file)'

# number of tasks is somewhat confusing in this context, so i've removed
# the check of appropriate task number

# SETUP MESOS LEAD - PACK GROUP 0
# get the lead hostname
master=$(hostname)
# set a port on lead
port=5050
#setup credentials to log into the lead from the follower(s)
echo "### mesos master on $master:$port"
secret=$(date --iso-8601=ns | md5sum | cut -f 1 -d ' ')
# those need different formats
credentials=$PWD/mesos.credentials
credential=$PWD/mesos.credential
# avoid exposing our secret in a race
rm -f $credentials $credential
touch $credentials $credential
chmod go-rwx $credentials $credential
echo '{ "credentials": [ { "principal": "'${USER}'", "secret": "'${secret}'" } ] }' >> $credentials
echo                    '{ "principal": "'${USER}'", "secret": "'${secret}'" }'     >> $credential
# start the mesos lead in --pack-group=0 and let it get setup
srun --pack-group=0 -N 1 -n 1 -c $SLURM_CPUS_PER_TASK_PACK_GROUP_0 --nodelist=$master \
     mesos-master --port=$port \
     --authenticate --authenticate_agents --credentials=$credentials \
     --registry=in_memory \
     > mesos-master.log 2>&1 &

# Wait a bit for lead to start
sleep 5
echo "LEAD NODE IS SETUP AND RUNNING"

# SETUP MESOS FOLLOWERS - PACK GROUP 1
# compute memory resources for each node 
memory=$(( $SLURM_CPUS_PER_TASK_PACK_GROUP_1 * $SLURM_MEM_PER_CPU_PACK_GROUP_1 ))

# set up follower nodes - note change here from original - since we're using
# pack groups, we don't need to subtract anything from the NTASKS of the pack
# group
#
# also note that in my installation I have set:
#
# export MESOS_SYSTEMD_ENABLE_SUPPORT=false
#
# in my ~/.bashrc, and passing --no-systemd_enable_support in addition
# to this environment variable causes a failure, so flag removed.  I have
# also incudeed --executor_environment_variables=$(echo {\"PATH\":\"$PATH\"})
# as noted by N0s3n down the thread at 
# https://github.com/ComparativeGenomicsToolkit/cactus/issues/128
#
for n in $(seq 1 $SLURM_NTASKS_PACK_GROUP_1); do
    echo starting follower $n
    srun --pack-group=1 -N 1 -n 1 -c $SLURM_CPUS_PER_TASK_PACK_GROUP_1 --exclusive --job-name=follower-$n-mesos \
         mesos-slave --master=$master:$port --port=$(( $port + $n )) \
         --credential=$credential \
         --work_dir=$local_tmpdir/follower-work-dir-$n-$USER-$SLURM_JOB_ID \
         --no-switch_user \
         --resources="cpus:${SLURM_CPUS_PER_TASK_PACK_GROUP_1};mem:${memory}" \
         --executor_environment_variables=$(echo {\"PATH\":\"$PATH\"}) \
         > mesos-slave-$n.log 2>&1 &
done
sleep 15
echo "FOLLOWER NODE(s) SETUP AND RUNNING"
echo "ACTIVATING CACTUS"
source activate cactus
echo "STARTING CACTUS"
# our system has 48 core nodes w/ 192 GB RAM.  adjust as needed.
cactus jobStore "$1" "$1".hal \
       --batchSystem Mesos \
       --mesosMaster $master:$port \
       --stats \
       --disableCaching \
       --containerImage /home/admin/singularity/cactus-1.0.0-ubuntu-16.04.simg \
       --binariesMode singularity \
       --setEnv SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH \
       --clean never \
       --cleanWorkDir never \
       --maxCores 48 \
       --maxMemory 192G \
#
seff $SLURM_JOB_ID
exit 0

ComparativeGenomicsToolkit / cactus

document SLURM as poor --batchSystem choice #128