Parallelizing jobs on Hyak

EleniLPetrou commented 3 years ago

Hi everybody!

I am taking my first baby steps into parallelizing jobs on Hyak (Klone). I have read Vince Buffalo's chapter on parallelizing tasks using the xargs command (pages 419-421 in Bioinformatics Data Skills) but I am still a bit confused on how to properly use xargs within an sbatch script. When I run a program that supports multithreading (like bowtie2), should I also use the xargs -P command? And do I ever need to specify #SBATCH --ntasks-per-node? I am having trouble seeing how all of these different puzzle pieces fit together. I would appreciate any guidance or learning resources that people might have. Thank you very much!!

I attach an example of the sbatch script I am working on, in case that helps:


#!/bin/bash
#SBATCH --job-name=elp_bowtie_V2
#SBATCH --account=merlab
#SBATCH --partition=compute-hugemem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=2:00:00
## Memory per node
#SBATCH --mem=80G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=elpetrou@uw.edu

##### ENVIRONMENT SETUP ##########
DATADIR=/mmfs1/gscratch/scrubbed/elpetrou/test #directory containing the trimmed fastq files
GENOMEDIR=/gscratch/merlab/genomes/atlantic_herring #directory containing the genome
GENOME_PREFIX=GCF_900700415.1_Ch_v2.0.2 #prefix of .bt2 files made by bowtie2
SUFFIX1=_R1_001.trim.fastq # Suffix to trimmed fastq files. The forward reads with paired-end data.
SUFFIX2=_R2_001.trim.fastq # Suffix to trimmed fastq files. The reverse reads with paired-end data.
OUTDIR=/mmfs1/gscratch/scrubbed/elpetrou/test #where to store output (sam) files

############################################################################
cd $DATADIR

## I am trying to use xargs to parallelize this task. Some notes:
## find - search for files in a directory hierarchy
##  xargs - build and execute command lines from standard input
## basename --suffix=SUFFIX: remove a trailing SUFFIX
## xargs -I: Replaces occurrences of replace-str in the initial-arguments with names read from standard input

find *$SUFFIX1 | xargs basename --suffix=$SUFFIX1 | xargs -I{} bowtie2 \
-x $GENOMEDIR'/'$GENOME_PREFIX \
--phred33 -q \
-1 {}$SUFFIX1 \
-2 {}$SUFFIX2 \
-S {}.sam \
--very-sensitive \
--minins 0 --maxins 1500 --fr \
--threads 20 \
--rg-id {} --rg SM:{} --rg LB:{} --rg PU:Lane1 --rg PL:ILLUMINA

github-actions[bot] commented 3 years ago

Thanks so much for posting your first issue in this repo!

kubu4 commented 3 years ago

Hey, nice work!

Here're some quick answers/suggestions:

should I also use the xargs -P command?

After reading the description in the xargs manual, I would probably avoid this (for now). However, it might work, since Bowtie can already handle multithreaded operations. Feel free to try one run with and one run without and see how the outputs compare.

And do I ever need to specify #SBATCH --ntasks-per-node?

The simplest answer is no.

The complicated answer is that it depends on how your group uses your computing nodes. As an example, the Roberts Lab is very "selfish". We always request the maximum amount of CPUs and memory available for our nodes. This is due to a number of reasons, but primarily because we rarely have a lot of users submitting jobs simultaneously, so there's no real need to "share" the resources with other lab memebers. We also have two nodes available, which usually ensures that everyone's jobs can get run, even if there's a queue for either of the nodes.

Unrelated, but here's a minor suggestion for your script above. Try to put any/all input/outputs/parameters into a variable. This will make re-using scripts for task later much easier (and safer). For example, add THREADS=20 to your ENVIRONMENT SETUP section and then replace --threads 20 with --threads $THREADS.

EleniLPetrou commented 3 years ago

Thank you so much for your help and guidance, @kubu4 ! I appreciate it a lot! :) I will go try out some of your suggestions.

OARS-SAFS / resources

Parallelizing jobs on Hyak #18