JensUweUlrich / Taxor

Fast and space-efficient taxonomic classification of long reads
BSD 3-Clause "New" or "Revised" License
41 stars 2 forks source link

indexing NCBI NT database? #8

Open mthang opened 1 month ago

mthang commented 1 month ago

Great tool ! I wonder if taxor can be used to index NCBI nt sequences ? Based on the documentation, taxor build uses the ftp link in the file (third column) to download the reference sequence from NCBI taxonomy. What I am trying to find out if the local NCBI nt sequences can be indexed by taxor without the download bit. This can be very useful for some taxonomy that only have nucleotide sequences (not all the taxonomy has whole genome reference data).

JensUweUlrich commented 2 weeks ago

Hi @mthang

Sorry for not responding earlier. I was on vacation.

Generally, you can use any kind of reference sequence data as long as you can provide a decent taxonomy. The only thing you need to do is downloading the fasta files and store them in a dedicated directory and provide the correct file name in the metadata file (Taxor does not download the files via ftp itself, but uses the filename from the ftp path to identify the correct file in the give directory). Besides that, the metadata file has to have the correct format as described here. That means you also need to provide taxonomic information for each sequence.