bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

Custom DB Issue #230

Open shaylashahar opened 2 years ago

shaylashahar commented 2 years ago

I want to create a custom database using GTDB (genome taxonomy databse) where the protein sequences are identified by their accession number. I wrote a python script to find the Tax ID of each accession number, and rewrite the file to fit kaiju's requirement. I successfully made the file to have ">Tax ID \n [protein sequence]" and then followed kaiju's directions to mkbwt and mkfmi.

I previously ran my metagenomes on nr_euk and ran kaiju2table with no issues. When I ran the same metagenomes with my GTDB, everything came out with zeros. I looked back at my .fmi and compared it to the .fmi from nr_euk and they look very different. My .fmi looks like protein sequences one after the next. I'm not sure what went wrong, but I'm pretty sure it happened when creating the .fmi file. Any ideas?

pmenzel commented 2 years ago

are those tax IDs also contained in the nodes.dmp/names.dmp files?

Are you sure, there wasn't an issue with mkbwt/mkfmi steps? It's difficult to see for me what went wrong from you description..

shaylashahar commented 2 years ago

Yes, Tax IDs are contained in nodes.dmp.

for the next step, I ran: kaiju-mkbwt -n 25 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa and then ran: kaiju-mkfmi proteins

pmenzel commented 2 years ago

Looks alright. Not sure how I can help you without your faa file.