a-h-b / dadasnake

Amplicon sequencing workflow heavily using DADA2 and implemented in snakemake
GNU General Public License v3.0
45 stars 17 forks source link

NCBI BLAST nt database configuration with dadasnake: Example config.yaml files for use with BLAST #35

Open jonwhit opened 1 year ago

jonwhit commented 1 year ago

Hi Anna and coauthors, thanks in advance for any advice. I really like the pipeline and could use some help getting it to work with using BLAST and NCBI's nt database. I am having issues getting the correct config settings for using NCBI nt database and taxdb as reference databases for COI.

What are the appropriate config parameters to use NCBI's nt database and taxonomy (taxdb) as reference for a marker like COI? Could you provide an example config.yaml file that uses Blast nt database as the reference db?

I am able to run the pipeline, but am getting errors at the blastn_cluster step. Specifically, the name of the blast database is 'nt', but because the NCBI nt database is so big there is not a single file named 'nt' but many files with nt.XXX. I am getting the error in logs/blastn_cluster.log. It appears the issues are with the makeblastdb step in blastn_cluster. The database is already made and in a local directory. I have the NCBI nt and taxdump database installed locally and following installation instructions from BASTA as linked in the dadasnake installation instructions.

Here are the errors I'm getting.

BLAST options error: File /home/jwhitney/dadasnake/DBs/blastdbs/nt does not exist.

log: logs/blastn_cluster.log (check log file(s) for error message)

conda-env: /home/jwhitney/programs/dadasnake/conda/66132e6a149ec730ec4c2d24861f8d4c

shell:

if [ -s clusteredTables/consensus.fasta ]; then

if [ ! -f "/home/jwhitney/dadasnake/DBs/blastdbs/nt.nin" ]

then

makeblastdb -dbtype nucl -in /home/jwhitney/dadasnake/DBs/blastdbs/nt -out /home/jwhitney/dadasnake/DBs/blastdbs/nt &> logs/blastn_cluster.log

fi

blastn -db /home/jwhitney/dadasnake/DBs/blastdbs/nt -query clusteredTables/consensus.fasta -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids stitle" -out clusteredTables/blast_results.tsv -max_target_seqs 10 &>> logs/blastn_cluster.log

else

touch clusteredTables/blast_results.tsv

fi

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)


And here are the relevant parts of the config.yaml

SETTINGS FOR TAXONOMIC ANNOTATION

taxonomy: dada: do: TRUE

classification is only done, if do_taxonomy is true

taxonomy: mothur: do: FALSE db_path: "/home/jwhitney/.basta/taxonomy" tax_db: ""

blast: do: true

blast is only done, if do_taxonomy is true

run_on:


Thanks in advance for any advice.

a-h-b commented 1 year ago

Hi Jonathan - sorry for the delayed answer. Can you try to checkout the rule workflow/rules/taxonomy.smk from the github repo please? up to now, dadasnake was checking for an un-chunked DB (.../nt.nin). Since you have a blastDB already (including .../nt.xx.nin and ..../nt.nal), the new rule should now find .../nt.nal and not attempt to make a new one. Let me know if it works - ahb

a-h-b commented 1 year ago

oh, and maybe a smaller COI reference database might be an alternative, see e.g. https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13756