dib-lab / farm-notes

notes on the farm cluster
16 stars 9 forks source link

automating and running things in parallel on farm #50

Open ctb opened 2 years ago

ctb commented 2 years ago

hackmd for editing here: https://hackmd.io/E8EgmtZoSe-lou4ZJnpiFw?both

A brief introduction to automation and parallelization on farm, our HPC


Download some files:

mkdir queries/
cd queries/
curl -JLO https://osf.io/download/q8h97/
curl -JLO https://osf.io/download/7bzrc/
curl -JLO https://osf.io/download/3kgvd/
cd ..
mkdir -p database/
cd database/
curl -JLO https://osf.io/download/4kfv9/
cd ../

Now you should have three files in queries/

ls -1 queries/

idba.scaffold.fa.gz megahit.final.contigs.fa.gz spades.scaffolds.fasta.gz

and one file in database/

ls -1 database/


Let's sketch the queries with sourmash:

for i in queries/*.gz
    sourmash sketch dna -p k=31,scaled=10000 $i -o $i.sig

Next, unpack the database and create database.zip:

cd database/
tar xzf podar*.tar.gz
sourmash sketch dna -p k=31,scaled=10000 *.fa --name-from-first -o ../database.zip
cd ../

Finally, make all your inputs read-only:

chmod a-w queries/* database.zip database/*

This prevents against accidental overwriting of the files.

Running your basic queries

You could now do:

sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

but what if each query is super slow and/or big, and you have dozens or hundreds of them?

Read on!

Automation and parallelization

1. Write a shell script.

Create the following shell script:


sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

and run it:

bash run1.sh


2. Add a for loop to your shell script.

There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.

Make a script run2.sh that contains a for loop instead.


for query in queries/*.sig
output=$(basename $query .sig).csv
sourmash gather $query database.zip -o $output


3. Write a for loop that creates a shell script.

Sometimes it's nice to generate a file that you can edit to finetune and customize. Let's do that.

At the shell prompt, run

for query in queries/*.sig
output=$(basename $query .sig).csv
echo sourmash gather $query database.zip -o $output
done > run3.sh

This creates a file run3.sh that contains the commands to run. Neato! You could now edit this file if you wanted to change up the commands.


4. Use parallel to run the commands instead.

Let's run the commands in run3.sh in parallel, instead of in serial, using GNU parallel:

parallel -j 2 < run3.sh


5. Write a second shell script that takes a parameter.

Let's switch things up - let's write a generic shell script that does the gather. Note that it's the same set of commands as in the for loops above!


output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

Now you can run this in a loop like so:

for i in queries/*.sig
    bash do-gather.sh $i


6. Change the second shell script to be an sbatch script.

Make do-gather.sh look like the following.

#SBATCH -c 1     # cpus per task
#SBATCH --mem=5Gb     # memory needed
#SBATCH --time=00-00:05:00     # time needed
#SBATCH -p med2 

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

This is now a script you can send to the HPC to run, using sbatch:

for i in queries/*.sig
    sbatch do-gather.sh $i


7. Write a snakemake file.

An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:


QUERY, = glob_wildcards("queries/{q}.sig")

rule all:
        expand("{q}.csv", q=QUERY)

rule run_query:
        sig = "queries/{q}.sig",
        csv = "{q}.csv"
    shell: """
        sourmash gather {input.sig} database.zip -o {output.csv}

and run it in parallel:

snakemake -j 2


Strategies for testing and evaluation

1. Build around an existing example.

2. Subsample your query data.

3. Test on a smaller version of your problem.

Appendix: making your shell script(s) nicer

1. Make them runnable without an explicit bash

Put #! /bin/bash at the top of the shell script and run chmod +x <scriptname>, and now you will be able to run them directly:


2. Set error exit

Add set -e to the top of your shell script and it will stop running when there's an error.