bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
269 stars 67 forks source link

custom database creation question #204

Open Valentin-Bio-zz opened 2 years ago

Valentin-Bio-zz commented 2 years ago

Hello I want to build my own database (GTDB + own built MAGs). I used prodigal to convert my nucleotide fasta files to protein fasta files. As I see prodigal assigns as first column of a fasta header the contig name that the assembler outputs. after the first column appears infor regarding prodigal functionality. The issue is that for building a kaiju custom database its necessary to sustitute the protein fasta headers with NCBI protein taxon identifier numbers. Should I do this buildijng my own script to assign the NCBI protein taxon identifiers?

Lets say that one protein fasta header is the following one:

k141_811263_4 # 1653 # 1775 # -1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.228

Here k141_811263_4 corresponds to the genome identifier. the "_4" substring its to the contig number of the draft genome.

the genome k141_811263 has been previously classified by GTDB and there is taxonomic information about the genome (classified by domain, phyla, clase, order, family, genus. species)

So I have to extract that classification info and match it with the NCBI taxon identifier number?

pmenzel commented 2 years ago

Yes, kaiju expects the NCBI taxon identifier in the sequence name. Maybe there is already a mapping somehwere between GTDB taxon names and NCBI taxon IDs that can be used..