issue with slurm - Githubissues

fmarletaz commented 6 years ago

Hi - I am trying to install and run FALCON on a slurm system. Installation goes well and the test seems conclusive. I read the configuration document, and attempted to use the blocking option for pwatcher_type. I then launch the assembly with a sbatch script with a simple command fc_run fc_run_3.cfg. However there are 2 problems: (1) in the configuration page, you advice to use a -W option (I assume for 'blocking' although I am not completely sure how it works), but this option in the latest slurm version expects a time: -W, --wait=<seconds>. I put 2 seconds in my config, but I am not sure it is the right option anymore. (2) the falcon run starts with building the database, but the script launched for the is only using 1 proc and 4Gb of memory, which do not match my specifications. This memory is too limited and the job crashes at some point. Moreover, this script does not appear in the scheduler (when checking squeue) although it appears to run. I would really appreciate some help on all that. Thanks!

[General]
input_fofn = input.fofn

input_type = raw

length_cutoff = 10000

length_cutoff_pr = 10000

pwatcher_type = blocking

pa_HPCdaligner_option =  -v -dal128 -t16 -e0.75 -M32 -l3200 -k18 -h480 -w8 -s100
ovlp_HPCdaligner_option = -v -dal128 -M32 -k24 -h1024 -e.96 -l2500 -s100

pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -s200

# error correction consensus options
falcon_sense_option = --output-multi --output-dformat --min-idt 0.70 --min-cov 4 --max-n-read 200 --n-core 8

# overlap filtering options
overlap_filtering_setting = --max-diff 120 --max-cov 120 --min-cov 2 --n-core 12

[job.defaults]
njobs = 32
job_type = string
pwatcher_type = blocking

submit = srun --wait 2 \
    -p ${JOB_QUEUE}  \
    -J ${JOB_NAME}             \
    -o ${JOB_STDOUT}        \
    -e ${JOB_STDERR}        \
    --mem-per-cpu=${MB}M     \
    --cpus-per-task=${NPROC} \
    ${JOB_SCRIPT}

JOB_QUEUE = compute
MB = 8000
NPROC = 4

[job.step.da]
NPROC = 4

[job.step.la]
NPROC = 16
MB = 4000

[job.step.pda]
NPROC = 8

[job.step.pla]
NPROC = 16
MB = 4000

[job.step.cns]
NPROC = 8 # also to pass --n-core=6 to falcon_sense

[job.step.asm]
NPROC = 24 # also to pass --n-core=24 to ovlp_filter
MB = 4000

The end of the all.log file:

2018-07-12 10:39:54,980 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:04,990 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:40:04,990 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:15,000 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:40:15,000 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:25,010 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:40:25,010 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:35,020 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:40:35,020 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:45,030 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:40:45,030 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:55,040 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:40:55,040 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:40:59,069 - pwatcher.blocking:235 - DEBUG - rc: 137
2018-07-12 10:40:59,069 - pwatcher.blocking:87 - DEBUG - Thread notify_exited(Pf01c81757be2e7->137).
2018-07-12 10:41:05,050 - pypeflow.simple_pwatcher_bridge:321 - DEBUG - N in queue: 1 (max_jobs=32)
2018-07-12 10:41:05,050 - pwatcher.blocking:463 - DEBUG - query(which='list', jobids=<1>)
2018-07-12 10:41:05,050 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/build) failed with exit-code=137
2018-07-12 10:41:05,050 - pypeflow.simple_pwatcher_bridge:339 - DEBUG - recently_done: [(Node(0-rawreads/build), False)]
2018-07-12 10:41:05,051 - pypeflow.simple_pwatcher_bridge:340 - DEBUG - Num done in this iteration: 1
2018-07-12 10:41:05,051 - pypeflow.simple_pwatcher_bridge:354 - ERROR - Some tasks are recently_done but not satisfied: set([Node(0-rawreads/build)])
2018-07-12 10:41:05,051 - pypeflow.simple_pwatcher_bridge:355 - ERROR - ready: set([])
    submitted: set([])
2018-07-12 10:41:05,051 - pwatcher.blocking:467 - DEBUG - delete(which='known', jobids=<0>)
2018-07-12 10:41:05,051 - pwatcher.blocking:431 - ERROR - Noop. We cannot kill blocked threads. Hopefully, everything will die on SIGTERM.
2018-07-12 10:41:05,051 - pypeflow.simple_pwatcher_bridge:189 - DEBUG - In notifyTerminate(), result of delete:None

zrlewis commented 6 years ago

Have you had any luck running on SLURM? I am also having trouble so I can't offer any real solutions, but I have a couple suggestions to try.

Depending on your configuration, changing srun to sbatch may help. Also, you could try hard-coding your resources and partition in the submit call and see if that helps. Our SLURM scheduler does not have the --wait option enabled, so removing that may also help.

pb-cdunn commented 6 years ago

Please let us know if you are able to find a reliable way to submit blocking calls to Slurm. We currently have to way to test that.

https://github.com/PacificBiosciences/pypeFLOW/wiki/configuration

Note that this is a general Slurm problem, not pypeFLOW. (And not Falcon.) If you are completely unable to get a blocking call that works (and you should be testing in your shell, not via Falcon/pypeflow), you can try the old pwatcher_type = fs_based. That expects non-blocking calls, but it is far more complex because it watches the filesystem to learn when jobs are finished.

PacificBiosciences / FALCON

issue with slurm #655