dib-lab / farm-notes

notes on the farm cluster
16 stars 9 forks source link

automating and running things in parallel on farm #50

Open ctb opened 2 years ago

ctb commented 2 years ago

hackmd for editing here: https://hackmd.io/E8EgmtZoSe-lou4ZJnpiFw?both

A brief introduction to automation and parallelization on farm, our HPC

Setup

Download some files:

mkdir queries/
cd queries/
curl -JLO https://osf.io/download/q8h97/
curl -JLO https://osf.io/download/7bzrc/
curl -JLO https://osf.io/download/3kgvd/
cd ..
mkdir -p database/
cd database/
curl -JLO https://osf.io/download/4kfv9/
cd ../

Now you should have three files in queries/

ls -1 queries/

idba.scaffold.fa.gz megahit.final.contigs.fa.gz spades.scaffolds.fasta.gz

and one file in database/

ls -1 database/

podar-complete-genomes-17.2.2018.tar.gz

Let's sketch the queries with sourmash:

for i in queries/*.gz
do
    sourmash sketch dna -p k=31,scaled=10000 $i -o $i.sig
done

Next, unpack the database and create database.zip:

cd database/
tar xzf podar*.tar.gz
sourmash sketch dna -p k=31,scaled=10000 *.fa --name-from-first -o ../database.zip
cd ../

Finally, make all your inputs read-only:

chmod a-w queries/* database.zip database/*

This prevents against accidental overwriting of the files.

Running your basic queries

You could now do:

sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

but what if each query is super slow and/or big, and you have dozens or hundreds of them?

Read on!

Automation and parallelization

1. Write a shell script.

Create the following shell script:

run1.sh:

sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

and run it:

bash run1.sh

Notes:

2. Add a for loop to your shell script.

There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.

Make a script run2.sh that contains a for loop instead.

run2.sh:

for query in queries/*.sig
do
output=$(basename $query .sig).csv
sourmash gather $query database.zip -o $output
done

Notes:

3. Write a for loop that creates a shell script.

Sometimes it's nice to generate a file that you can edit to finetune and customize. Let's do that.

At the shell prompt, run

for query in queries/*.sig
do
output=$(basename $query .sig).csv
echo sourmash gather $query database.zip -o $output
done > run3.sh

This creates a file run3.sh that contains the commands to run. Neato! You could now edit this file if you wanted to change up the commands.

Notes:

4. Use parallel to run the commands instead.

Let's run the commands in run3.sh in parallel, instead of in serial, using GNU parallel:

parallel -j 2 < run3.sh

Notes:

5. Write a second shell script that takes a parameter.

Let's switch things up - let's write a generic shell script that does the gather. Note that it's the same set of commands as in the for loops above!

do-gather.sh:

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

Now you can run this in a loop like so:

for i in queries/*.sig
do
    bash do-gather.sh $i
done

Notes:

6. Change the second shell script to be an sbatch script.

Make do-gather.sh look like the following.

#SBATCH -c 1     # cpus per task
#SBATCH --mem=5Gb     # memory needed
#SBATCH --time=00-00:05:00     # time needed
#SBATCH -p med2 

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

This is now a script you can send to the HPC to run, using sbatch:

for i in queries/*.sig
do
    sbatch do-gather.sh $i
done

Notes:

7. Write a snakemake file.

An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:

Snakefile:

QUERY, = glob_wildcards("queries/{q}.sig")

rule all:
    input:
        expand("{q}.csv", q=QUERY)

rule run_query:
    input:
        sig = "queries/{q}.sig",
    output:
        csv = "{q}.csv"
    shell: """
        sourmash gather {input.sig} database.zip -o {output.csv}
    """

and run it in parallel:

snakemake -j 2

Notes:

Strategies for testing and evaluation

1. Build around an existing example.

2. Subsample your query data.

3. Test on a smaller version of your problem.

Appendix: making your shell script(s) nicer

1. Make them runnable without an explicit bash

Put #! /bin/bash at the top of the shell script and run chmod +x <scriptname>, and now you will be able to run them directly:

./run1.sh

2. Set error exit

Add set -e to the top of your shell script and it will stop running when there's an error.