ctb / magsearch

Workflow and config files for searching (very) large public databases with sourmash sketches
GNU Affero General Public License v3.0
3 stars 0 forks source link

notes: running magsearch for christy #2

Open ctb opened 2 years ago

ctb commented 2 years ago

editing here: https://hackmd.io/EQG9YLZwQGOeoKWjy-fHFg

Running MAGsearch for Christy

Christy G. asked me to run MAGsearch for her, and I thought I'd document it this time!

first, sketch the genomes.

I grabbed all of her genomes and then ran:

sourmash sketch dna -p k=31,scaled=1000 *

in the directory containing the FASTA files.

I then put them in a zip file:

zip -r christy-2022.09.25.zip *.sig

and transferred them to farm (our HPC).

2. unpack the sketches and generate a list

On farm, I went to my MAGsearch directory:

cd ~ctbrown/scratch/magsearch
mkdir query.christy-2022.09.25

and unzipped the sketches:

unzip ~/transfer/christy-2022.09.25.zip

and made a list of the files relative to the base MAGsearch directory:

ls -1 query.christy-2022.09.25/* > query.christy-2022.09.25.txt

3. make a configuration file

I made a new copy of the config file:

cp config.yml config-christy-2022.09.25.yml

and then added the search-specific things:

# unique query name
query_name: christy-2022.09.25

# list of paths of query signatures - 1 or more.
query_sigs: query.christy-2022.09.25.txt

# catalog to search - list of paths of subject signatures
#catalog: /group/ctbrowngrp/sra_search/catalogs/metagenomes
catalog: catalog.sub

# containment threshold to use
threshold: 0.01

# k-mer size to use
ksize: 31

# scaled to use
scaled: 1000

# where to put the results
out_dir: "output.magsearch"

4. start an srun session

Next I started screen and ran a beefy srun:

screen -S magsearch-christy
srun -p high2 --time=48:00:00 --nodes=1 --cpus-per-task 32 --mem 50GB --pty /bin/bash

and ran a test:

snakemake -s magsearch.snakefile --configfile config-christy-2022.09.25.yml -j 32

note that this is a test because I'm only searching a small catalog, catalog.sub - this makes sure the queries etc can all be loaded before we run the thing for a day or two!

5. check logs for test

It looks like all went well:

% cat output.magsearch/logs/sra_search.k31.log
[2022-09-25T12:56:54Z INFO  sra_search] Loading queries
[2022-09-25T12:56:54Z INFO  sra_search] Loaded 27 query signatures
[2022-09-25T12:56:54Z INFO  sra_search] Loading siglist
[2022-09-25T12:56:54Z INFO  sra_search] Loaded 14 sig paths in siglist
[2022-09-25T12:56:54Z INFO  sra_search] Processed 0 search sigs

(the last line is output only every so often, so more than 0 search sigs were processed.)

6. run for realz

Remove test output,

rm output.magsearch/results/christy-2022.09.25.csv 

edit the config file like so:

# catalog to search - list of paths of subject signatures
catalog: /group/ctbrowngrp/sra_search/catalogs/metagenomes
#catalog: catalog.sub

and run!

snakemake -s magsearch.snakefile --configfile config-christy-2022.09.25.yml -j 32