hackmd for editing here: https://hackmd.io/E8EgmtZoSe-lou4ZJnpiFw?both

A brief introduction to automation and parallelization on farm, our HPC

Setup

Download some files:

mkdir queries/
cd queries/
curl -JLO https://osf.io/download/q8h97/
curl -JLO https://osf.io/download/7bzrc/
curl -JLO https://osf.io/download/3kgvd/
cd ..
mkdir -p database/
cd database/
curl -JLO https://osf.io/download/4kfv9/
cd ../

Now you should have three files in queries/

ls -1 queries/

idba.scaffold.fa.gz megahit.final.contigs.fa.gz spades.scaffolds.fasta.gz

and one file in database/

ls -1 database/

podar-complete-genomes-17.2.2018.tar.gz

Let's sketch the queries with sourmash:

for i in queries/*.gz
do
    sourmash sketch dna -p k=31,scaled=10000 $i -o $i.sig
done

Next, unpack the database and create database.zip:

cd database/
tar xzf podar*.tar.gz
sourmash sketch dna -p k=31,scaled=10000 *.fa --name-from-first -o ../database.zip
cd ../

Finally, make all your inputs read-only:

chmod a-w queries/* database.zip database/*

This prevents against accidental overwriting of the files.

Running your basic queries

You could now do:

sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

but what if each query is super slow and/or big, and you have dozens or hundreds of them?

Read on!

Automation and parallelization

1. Write a shell script.

Create the following shell script:

run1.sh:

sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

and run it:

bash run1.sh

Notes:

all your commands will run in serial, one after the other;
the memory usage of the script will be the same as the memory usage of the largest command;

2. Add a for loop to your shell script.

There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.

Make a script run2.sh that contains a for loop instead.

run2.sh:

for query in queries/*.sig
do
output=$(basename $query .sig).csv
sourmash gather $query database.zip -o $output
done

Notes:

same runtime parameters as run1.sh
yes, we carefully structured the filenames so that the for loop would work :)
the output= line uses basename to remove the queries/ prefix and .sig suffix from each query filename.

3. Write a for loop that creates a shell script.

Sometimes it's nice to generate a file that you can edit to finetune and customize. Let's do that.

At the shell prompt, run

for query in queries/*.sig
do
output=$(basename $query .sig).csv
echo sourmash gather $query database.zip -o $output
done > run3.sh

This creates a file run3.sh that contains the commands to run. Neato! You could now edit this file if you wanted to change up the commands.

Notes:

same runtime parameters as above: everything runs in serial.
be careful about overwriting run3.sh by accident after you've edited it!

4. Use `parallel` to run the commands instead.

Let's run the commands in run3.sh in parallel, instead of in serial, using GNU parallel:

parallel -j 2 < run3.sh

Notes:

depending on the parameter to -j, this can be much faster - here, twice as fast!
it will also use twice as much memory...
parallel runs each line on its own. So if you have multiple things you want to run in each parallel session, you need to do something different.

5. Write a second shell script that takes a parameter.

Let's switch things up - let's write a generic shell script that does the gather. Note that it's the same set of commands as in the for loops above!

do-gather.sh:

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

Now you can run this in a loop like so:

for i in queries/*.sig
do
    bash do-gather.sh $i
done

Notes:

here, $1 is the first command-line parameter after the shell script name.
this is back to processing in serial.

6. Change the second shell script to be an sbatch script.

Make do-gather.sh look like the following.

#SBATCH -c 1     # cpus per task
#SBATCH --mem=5Gb     # memory needed
#SBATCH --time=00-00:05:00     # time needed
#SBATCH -p med2 

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

This is now a script you can send to the HPC to run, using sbatch:

for i in queries/*.sig
do
    sbatch do-gather.sh $i
done

Notes:

this distributes your job across the HPC;
these jobs only take up as much time/memory as each job individually! but now they're running in parallel on multiple machines.
do-gather.sh is still a bash script so you can still run it that way.

7. Write a snakemake file.

An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:

Snakefile:

QUERY, = glob_wildcards("queries/{q}.sig")

rule all:
    input:
        expand("{q}.csv", q=QUERY)

rule run_query:
    input:
        sig = "queries/{q}.sig",
    output:
        csv = "{q}.csv"
    shell: """
        sourmash gather {input.sig} database.zip -o {output.csv}
    """

and run it in parallel:

snakemake -j 2

Notes:

this will run things in parallel as in the above examples.

Strategies for testing and evaluation

1. Build around an existing example.

2. Subsample your query data.

3. Test on a smaller version of your problem.

Appendix: making your shell script(s) nicer

1. Make them runnable without an explicit `bash`

Put #! /bin/bash at the top of the shell script and run chmod +x <scriptname>, and now you will be able to run them directly:

./run1.sh

2. Set error exit

Add set -e to the top of your shell script and it will stop running when there's an error.

dib-lab / farm-notes

automating and running things in parallel on farm #50

A brief introduction to automation and parallelization on farm, our HPC

Setup

Running your basic queries

Automation and parallelization

1. Write a shell script.

2. Add a for loop to your shell script.

3. Write a for loop that creates a shell script.

4. Use `parallel` to run the commands instead.

5. Write a second shell script that takes a parameter.

6. Change the second shell script to be an sbatch script.

7. Write a snakemake file.

Strategies for testing and evaluation

1. Build around an existing example.

2. Subsample your query data.

3. Test on a smaller version of your problem.

Appendix: making your shell script(s) nicer

1. Make them runnable without an explicit `bash`

2. Set error exit

dib-lab / farm-notes

automating and running things in parallel on farm #50

A brief introduction to automation and parallelization on farm, our HPC

Setup

Running your basic queries

Automation and parallelization

1. Write a shell script.

2. Add a for loop to your shell script.

3. Write a for loop that creates a shell script.

4. Use parallel to run the commands instead.

5. Write a second shell script that takes a parameter.

6. Change the second shell script to be an sbatch script.

7. Write a snakemake file.

Strategies for testing and evaluation

1. Build around an existing example.

2. Subsample your query data.

3. Test on a smaller version of your problem.

Appendix: making your shell script(s) nicer

1. Make them runnable without an explicit bash

2. Set error exit

4. Use `parallel` to run the commands instead.

1. Make them runnable without an explicit `bash`