clark-lab / ngsane

Analysis Framework for Biological Data from High Throughput Sequencing Experiments
Other
5 stars 4 forks source link

SGE Job Arrays #13

Open noncodo opened 11 years ago

noncodo commented 11 years ago

I would like to test/implement an enhancement where instead of submitting (for example) 12 jobs to the SGE queue when NGSANE'ing 12 fastq files, the homologous jobs are launched in an array. One array per fastq/subdir would be a much cleaner and efficient way of executing this. Any thoughts/concerns?

allPowerde commented 11 years ago

Yes it would be cleaner, however, while for some tasks (e.g. mapping) there is not much changing between each file, for other tasks (e.g. snp-calling) there might be multiple variable that need to be passed to the actual program call. What would your approach be for that? E.g. are you planning to list all the necessary variables in the qout/task/runnow.txt file, which is then resolved in each submitted array-job ?

On Thu, Jun 13, 2013 at 11:18 AM, martqc notifications@github.com wrote:

I would like to test/implement an enhancement where instead of submitting (for example) 12 jobs to the SGE queue when NGSANE'ing 12 fastq files, the homologous jobs are launched in an array. One array per fastq/subdir would be a much cleaner and efficient way of executing this. Any thoughts/concerns?

— Reply to this email directly or view it on GitHubhttps://github.com/Gurado/ngsane/issues/13 .

noncodo commented 11 years ago

Instead of qsub'ing each sample in the project folder independently, you just do something like "qsub -array 1-wc -l runnow.tmp script.sh" and use "head -n $ARRAY_INDEX runnow.tmp | tail -n 1" in the script to specify the file or the specific parameters. Ideally, identical or highly similar tasks should be performed in an array job, i.e. each sample within the project folder. I'll try to implement something in a branch of 0.0.1.
For true MPI jobs with independent input, low RAM and high CPU demands, you can leverage your $MAX_CPU quota on a cluster by splitting your input file (e.g. fastq) into ≥$MAX_CPU files, then running an array job with one task / file / CPU. However this would drastically reduce your coffee break / sword fighting / NGSane mod development time.