Closed elsherbini closed 3 years ago
Hi! Any progress on this? We have interproscan on our slurm cluster and I'm trying to figure out how to set up clustermode, so interested in your solution. Just took a quick look, sbatch seems to lack an equivalent of the qsub "-b" option, maybe the "--wall" option can be used?
When I opened this, the most recent release was 5.17. I ran into some problems and opened a ticket. Got the following reply:
Thanks for the email. There is a bug when you run InterProScan and use '-mode cluster' on SGE/SLURM. We are working on fixing this for the next release. In the meantime, you will have to submit the jobs to the cluster without using the cluster mode.
However, I see in the 5.18 changelog that there is a fix:
- Fixed issues encountered when running InterProScan in cluster mode
(--mode cluster) on SGE/SLURM.
I haven't looked into it further.
I found that I could do a bacterial genome in standalone mode in 2-4 hours on a node. I used snakemake to submit jobs on slurm which worked quite well.
I asked on interproscan support how best to modify the interproscan.properties file for 16 cores:
In standalone mode, interproscan will use one node. If you have 16 processors on the node, then the best configuration would be to change the property
maxnumber.of.embedded.workers=14
and use 1 processor for each analysis, e.g.
hmmer3.hmmsearch.cpu.switch.tigrfam=--cpu 1
This will run maximum 14 analyses in parallel.
This allowed me to annotated ~1000 genomes in a day on the cluster, without having to use cluster mode.
I've made a gist showing the three files I used to run interproscan using snakemake on the slurm cluster:
https://gist.github.com/elsherbini/ef74373839588f2a1ba3fd5d5b8ab0d6
Thanks! Seems like standalone mode works fine for me too.
Thank you. I using slurm with:
sbatch --time=7-00:00:00 -c20 -n1 --mem-per-cpu 5000 mybash.sh
where my mybash.sh has:
path-to-interproscan/interproscan-5.23-62.0/interproscan.sh -mode cluster -i $SLURM_ARRAY_TASK_ID.fa -o $SLURM_ARRAY_TASK_ID.proscan -f tsv --goterms
I edited the interproscan.properties with maxnumber.of.embedded.workers=19 .
Everything looks like it is running fine but it is already 24 hrs and some of my jobs seems to keep running but with not update in their tmp files (some jobs has finished in less than 2 hours). I am also not getting any *.proscan (output), only files the folder 'tmp', but as I said, no new edited files since almost 12 hours ago.
Is something wrong or I should just wait for more time? each of my input has ~800 sequences.
Thank you
I would also appreciate a native SLURM backend implementation, and if it is already supported, please update the documentation accordingly. In general, it feels awkward to set the number of processes in the properties file for different runs with different number of cores and there should be an autodetection routine which adapts to the number of assigned CPUs/cores or an appropriate command line option.
Thanks all!
@fungs I have no news to report on SLURM here, but we plan to add a command line option to override the "maxnumber.of.embedded.workers" property.
@gushiro It sounds like this is running. If you find jobs are getting stuck it's possibly because of the Gene3D post processing memory requirements, please see Issue 27.
But your 800 sequences, are they protein sequences? I would assume yes as you didn't use the "-t n" option. So for 800 protein sequences you should be fine with the default standalone mode, the overhead of cluster mode would only give benefits on larger inputs.
anyone knows the current state?
unfortunately we don't have a SLURM environment to test
@gsn7: try something like https://hub.docker.com/r/giovtorres/docker-centos7-slurm
thanks. we will try it out.
I plan to use interproscan on a Slurm cluster and I'll be editing the interproscan.properties to work on it. I can submit a pull request with the interproscan.properties edited to include those changes so others won't have to do that step if they plan on using Slurm.