bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.06k stars 182 forks source link

Retrieval more taxonomics IDs than the one present in the "prot.accession2taxid.FULL" #599

Open eray-sahin opened 2 years ago

eray-sahin commented 2 years ago

Hello,

I built Diamond database using; diamond makedb --in nr.gz \ –db nr_diamond –taxonmap prot.accession2taxid.FULL –taxonnodes nodes.dmp –taxonnames names.dmp –threads 72

and run diamond blastp to get a tabular file with subject sequence id and matching taxonomic IDs. When I inspected some of the results, even there is only one matching taxonomic ID for a protein (for ex, tax ID for 'WP_119979703.1' is '2292949') in "prot.accession2taxid.FULL" and on NCBI website, I got more than one entries for some ("29523" and "2292949" for 'WP_119979703.1').

When I try to use MEGAN, the LCA algorithm may cause to retrieve root for most of such entries, and loosing the taxon resolution. I cannot perform manual search in"prot.accession2taxid.FULL", because it will take ages. Can you help me to understand the issue?

Best regards.

bbuchfink commented 2 years ago

If you look up this protein with NCBI, you can see under identical proteins that there's an entry (MBO4974725.1) with taxon id 29523. These entries are merged if you use the NR database. I don't have a good solution for this at the moment. Would an option to ignore all taxids above species rank help you?