Open ctb opened 2 years ago
hackmd for editing here: https://hackmd.io/E8EgmtZoSe-lou4ZJnpiFw?both
Download some files:
mkdir queries/ cd queries/ curl -JLO https://osf.io/download/q8h97/ curl -JLO https://osf.io/download/7bzrc/ curl -JLO https://osf.io/download/3kgvd/ cd .. mkdir -p database/ cd database/ curl -JLO https://osf.io/download/4kfv9/ cd ../
Now you should have three files in queries/
ls -1 queries/
idba.scaffold.fa.gz megahit.final.contigs.fa.gz spades.scaffolds.fasta.gz
and one file in database/
ls -1 database/
podar-complete-genomes-17.2.2018.tar.gz
Let's sketch the queries with sourmash:
for i in queries/*.gz do sourmash sketch dna -p k=31,scaled=10000 $i -o $i.sig done
Next, unpack the database and create database.zip:
database.zip
cd database/ tar xzf podar*.tar.gz sourmash sketch dna -p k=31,scaled=10000 *.fa --name-from-first -o ../database.zip cd ../
Finally, make all your inputs read-only:
chmod a-w queries/* database.zip database/*
This prevents against accidental overwriting of the files.
You could now do:
sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv
but what if each query is super slow and/or big, and you have dozens or hundreds of them?
Read on!
Create the following shell script:
run1.sh:
run1.sh
and run it:
bash run1.sh
Notes:
There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.
Make a script run2.sh that contains a for loop instead.
run2.sh
run2.sh:
for query in queries/*.sig do output=$(basename $query .sig).csv sourmash gather $query database.zip -o $output done
for
output=
basename
queries/
.sig
Sometimes it's nice to generate a file that you can edit to finetune and customize. Let's do that.
At the shell prompt, run
for query in queries/*.sig do output=$(basename $query .sig).csv echo sourmash gather $query database.zip -o $output done > run3.sh
This creates a file run3.sh that contains the commands to run. Neato! You could now edit this file if you wanted to change up the commands.
run3.sh
parallel
Let's run the commands in run3.sh in parallel, instead of in serial, using GNU parallel:
parallel -j 2 < run3.sh
-j
Let's switch things up - let's write a generic shell script that does the gather. Note that it's the same set of commands as in the for loops above!
do-gather.sh:
do-gather.sh
output=$(basename $1 .sig).csv sourmash gather $1 database.zip -o $output
Now you can run this in a loop like so:
for i in queries/*.sig do bash do-gather.sh $i done
$1
Make do-gather.sh look like the following.
#SBATCH -c 1 # cpus per task #SBATCH --mem=5Gb # memory needed #SBATCH --time=00-00:05:00 # time needed #SBATCH -p med2 output=$(basename $1 .sig).csv sourmash gather $1 database.zip -o $output
This is now a script you can send to the HPC to run, using sbatch:
sbatch
for i in queries/*.sig do sbatch do-gather.sh $i done
An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:
Snakefile:
Snakefile
QUERY, = glob_wildcards("queries/{q}.sig") rule all: input: expand("{q}.csv", q=QUERY) rule run_query: input: sig = "queries/{q}.sig", output: csv = "{q}.csv" shell: """ sourmash gather {input.sig} database.zip -o {output.csv} """
and run it in parallel:
snakemake -j 2
bash
Put #! /bin/bash at the top of the shell script and run chmod +x <scriptname>, and now you will be able to run them directly:
#! /bin/bash
chmod +x <scriptname>
./run1.sh
Add set -e to the top of your shell script and it will stop running when there's an error.
set -e
hackmd for editing here: https://hackmd.io/E8EgmtZoSe-lou4ZJnpiFw?both
A brief introduction to automation and parallelization on farm, our HPC
Setup
Download some files:
Now you should have three files in queries/
and one file in database/
Let's sketch the queries with sourmash:
Next, unpack the database and create
database.zip
:Finally, make all your inputs read-only:
This prevents against accidental overwriting of the files.
Running your basic queries
You could now do:
but what if each query is super slow and/or big, and you have dozens or hundreds of them?
Read on!
Automation and parallelization
1. Write a shell script.
Create the following shell script:
run1.sh
:and run it:
Notes:
2. Add a for loop to your shell script.
There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.
Make a script
run2.sh
that contains a for loop instead.run2.sh
:Notes:
run1.sh
for
loop would work :)output=
line usesbasename
to remove thequeries/
prefix and.sig
suffix from each query filename.3. Write a for loop that creates a shell script.
Sometimes it's nice to generate a file that you can edit to finetune and customize. Let's do that.
At the shell prompt, run
This creates a file
run3.sh
that contains the commands to run. Neato! You could now edit this file if you wanted to change up the commands.Notes:
4. Use
parallel
to run the commands instead.Let's run the commands in
run3.sh
in parallel, instead of in serial, using GNUparallel
:Notes:
-j
, this can be much faster - here, twice as fast!parallel
runs each line on its own. So if you have multiple things you want to run in each parallel session, you need to do something different.5. Write a second shell script that takes a parameter.
Let's switch things up - let's write a generic shell script that does the gather. Note that it's the same set of commands as in the for loops above!
do-gather.sh
:Now you can run this in a loop like so:
Notes:
$1
is the first command-line parameter after the shell script name.6. Change the second shell script to be an sbatch script.
Make
do-gather.sh
look like the following.This is now a script you can send to the HPC to run, using
sbatch
:Notes:
do-gather.sh
is still a bash script so you can still run it that way.7. Write a snakemake file.
An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:
Snakefile
:and run it in parallel:
Notes:
Strategies for testing and evaluation
1. Build around an existing example.
2. Subsample your query data.
3. Test on a smaller version of your problem.
Appendix: making your shell script(s) nicer
1. Make them runnable without an explicit
bash
Put
#! /bin/bash
at the top of the shell script and runchmod +x <scriptname>
, and now you will be able to run them directly:2. Set error exit
Add
set -e
to the top of your shell script and it will stop running when there's an error.