ebi-pf-team / interproscan

Genome-scale protein function classification
Apache License 2.0
298 stars 67 forks source link

Support Slurm as a cluster option in interproscan.properties #3

Closed elsherbini closed 3 years ago

elsherbini commented 8 years ago

I plan to use interproscan on a Slurm cluster and I'll be editing the interproscan.properties to work on it. I can submit a pull request with the interproscan.properties edited to include those changes so others won't have to do that step if they plan on using Slurm.

argju commented 8 years ago

Hi! Any progress on this? We have interproscan on our slurm cluster and I'm trying to figure out how to set up clustermode, so interested in your solution. Just took a quick look, sbatch seems to lack an equivalent of the qsub "-b" option, maybe the "--wall" option can be used?

elsherbini commented 8 years ago

When I opened this, the most recent release was 5.17. I ran into some problems and opened a ticket. Got the following reply:

Thanks for the email. There is a bug when you run InterProScan and use '-mode cluster' on SGE/SLURM. We are working on fixing this for the next release. In the meantime, you will have to submit the jobs to the cluster without using the cluster mode.

However, I see in the 5.18 changelog that there is a fix:

- Fixed issues encountered when running InterProScan in cluster mode
(--mode cluster) on SGE/SLURM.

I haven't looked into it further.

elsherbini commented 8 years ago

I found that I could do a bacterial genome in standalone mode in 2-4 hours on a node. I used snakemake to submit jobs on slurm which worked quite well.

I asked on interproscan support how best to modify the interproscan.properties file for 16 cores:

In standalone mode, interproscan will use one node. If you have 16 processors on the node, then the best configuration would be to change the property maxnumber.of.embedded.workers=14

and use 1 processor for each analysis, e.g. hmmer3.hmmsearch.cpu.switch.tigrfam=--cpu 1

This will run maximum 14 analyses in parallel.

This allowed me to annotated ~1000 genomes in a day on the cluster, without having to use cluster mode.

I've made a gist showing the three files I used to run interproscan using snakemake on the slurm cluster:

https://gist.github.com/elsherbini/ef74373839588f2a1ba3fd5d5b8ab0d6

argju commented 8 years ago

Thanks! Seems like standalone mode works fine for me too.

gushiro commented 7 years ago

Thank you. I using slurm with:

sbatch --time=7-00:00:00 -c20 -n1 --mem-per-cpu 5000 mybash.sh

where my mybash.sh has:

path-to-interproscan/interproscan-5.23-62.0/interproscan.sh -mode cluster -i $SLURM_ARRAY_TASK_ID.fa -o $SLURM_ARRAY_TASK_ID.proscan -f tsv --goterms

I edited the interproscan.properties with maxnumber.of.embedded.workers=19 .

Everything looks like it is running fine but it is already 24 hrs and some of my jobs seems to keep running but with not update in their tmp files (some jobs has finished in less than 2 hours). I am also not getting any *.proscan (output), only files the folder 'tmp', but as I said, no new edited files since almost 12 hours ago.

Is something wrong or I should just wait for more time? each of my input has ~800 sequences.

Thank you

fungs commented 7 years ago

I would also appreciate a native SLURM backend implementation, and if it is already supported, please update the documentation accordingly. In general, it feels awkward to set the number of processes in the properties file for different runs with different number of cores and there should be an autodetection routine which adapts to the number of assigned CPUs/cores or an appropriate command line option.

mifraser commented 7 years ago

Thanks all!

@fungs I have no news to report on SLURM here, but we plan to add a command line option to override the "maxnumber.of.embedded.workers" property.

@gushiro It sounds like this is running. If you find jobs are getting stuck it's possibly because of the Gene3D post processing memory requirements, please see Issue 27.

But your 800 sequences, are they protein sequences? I would assume yes as you didn't use the "-t n" option. So for 800 protein sequences you should be fine with the default standalone mode, the overhead of cluster mode would only give benefits on larger inputs.

ps-account commented 6 years ago

anyone knows the current state?

gsn7 commented 6 years ago

unfortunately we don't have a SLURM environment to test

fungs commented 6 years ago

@gsn7: try something like https://hub.docker.com/r/giovtorres/docker-centos7-slurm

gsn7 commented 6 years ago

thanks. we will try it out.