Joseph7e / Assign-Taxonomy-with-BLAST

Assign taxonomy with blast, can be used for qiime
21 stars 8 forks source link

provide an example of customized database #2

Closed bioinfo17 closed 4 years ago

bioinfo17 commented 4 years ago

Hi,

The link to the customized database ncbi taxonomy database for this script: --> http://cobb.unh.edu/ncbi_taxonomy_expanded.tsv.gz is broken/the file can not be downloaded. Could you please give an example of a customized database? What should be the format of a customized database to be able to use the assign_taxonomy_with_blast.py script please?

Many thanks

Joseph7e commented 4 years ago

Hello, I added a script with directions to construct an updated taxonomy lookup file. Basically download the script "genbank_nodes_and_names_to_taxonomy.py" and run these commands from a terminal.

mkdir ncbi_taxonomy/ && cd ncbi_taxonomy/ wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz tar xvzf *.tar.gz python3 genbank_nodes_and_names_to_taxonomy.py names.dmp nodes.dmp

This file should be good for any sequence data on NCBI.

If you construct a database from this NCBI this file will work but any custom database that satisfies the data structure should work with the script.

The custom database format is a seq_id followed by a tab followed by a semicolon separated taxonomy string. like this

2315627 Viruses;unknown_kingdom;unknown_phylum;unknown_subphylum;unknown_superclass;unknown_class;unknown_subclass;unknown_superorder;Caudovirales;unknown_superfamily;Siphoviridae;unknown_subfamily;unknown_genus;Bacillus phage Ray17

The sequence identifier (2315627 in the above example) should match the headers in the input reference fasta.