cdanielmachado / carveme

CarveMe: genome-scale metabolic model reconstruction
Other
149 stars 51 forks source link

Running CarveMe on HPC cluster #34

Closed rdmtinez closed 5 years ago

rdmtinez commented 5 years ago

Greetings,

I noticed that when trying to do a batch job over our cluster like:

bsub -q queue carve -r --dna ./data/input_*.fna -o ./output/

when queue = a cluster with limits (e.g. memory, thread, processes) carveme does not obey the limits imposed and executes too many requests (sorry my lingo in this field is lacking) and the entire job is killed by the management system.

Errors look like this:

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
carve -r -v --dna ./assemblies/LjRoot16.fna ./assemblies/LjRoot160.fna ./assemblies/LjRoot161.fna 
./assemblies/LjRoot162.fna ./assemblies/LjRoot163.fna ./assemblies/LjRoot164.fna ./assemblies 
/LjRoot165.fna ./assemblies/LjRoot166.fna ./assemblies/LjRoot167.fna ./assemblies/LjRoot168.fna 
./assemblies/LjRoot169.fna -o ./carve_output/
------------------------------------------------------------

TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 143.

Resource usage summary:

    CPU time   :      8.07 sec.
    Max Memory :      5719 MB
    Max Swap   :    422528 MB

    Max Processes  :       142
    Max Threads    :       744

IBM Says the following about the code:

Exit codes less than 128 relate to application exit values, while exit codes greater than 128 relate to system signal exit values (LSF adds 128 to system values).

I thus talked to IT and they said that it was probably due to the carveme not supporting LSF/carvme communication... is this something that is planned on a future release... just curious

p.s. I also tried running the job that produced the above error on my Lenovo P910 (8Gb RAM, 8 Gb swap) running Ubuntu... is this also an example of carveme not taking into consideration the limits of my computer... and is that something that can be fixed and contributed to?

Best regards,

Ricardo Martinez

cdanielmachado commented 5 years ago

The -r option uses the multiprocessing module (https://docs.python.org/2/library/multiprocessing.html), which is not very sophisticated, it simply launches multiple processes at once (as many as possible I suppose).

We don't have any support for specific cluster systems. I have experience running CarveMe with LSF and Slurm (and migrating between the two).

The most efficient way to run CarveMe on a cluster is to submit your jobs as a job array (most cluster systems support job arrays), instead of using the -r option. That will allow the cluster system to efficiently allocate each job, and manage the number of running jobs at any given time. Just use the job index in the environment variables to select the genome file each job will use.

rdmtinez commented 5 years ago

Thanks for that info!