dib-lab / genome-grist

map Illumina metagenomes to genomes!
https://dib-lab.github.io/genome-grist/
Other
36 stars 6 forks source link

thoughts on taxonomic comparisons, protein searches, and challenges therein #49

Open ctb opened 3 years ago

ctb commented 3 years ago

I'm becoming rather fond of the protein search (#40) and want to combine DNA and protein searches (#48) but keep on running into mental problems about how to deal with taxonomy. This issue is to explain the issues to future me.

In https://github.com/dib-lab/genome-grist/issues/40, I explored comparison of DNA vs protein taxonomy.

However, the challenge is that I used different databases! I used the Genbank compleat database (700k+ genomes) for DNA searching, and the GTDB genus-level protein database for protein searching. That's OK, because both databases have GenBank identifiers that let me pull out NCBI taxonomy.

I do prefer the GTDB taxonomy for things, but the majority of GenBank genomes don't have GTDB taxonomy assigned! I suppose we could relabel all of GenBank with GTDB...

In addition, GTDB doesn't have euk sequences in it, so we'd have to figure out how to add euk in. This now makes the question relevant to charcoal https://github.com/dib-lab/charcoal/issues/30, proving that ultimately all of our software gets stuck on the same set of hard problems :).

Anyway, for now I'm stuck using (a) NCBI taxonomy with (b) a protein database that doesn't contain euk sequences.

ctb commented 3 years ago

We could potentially do something where we use GTDB for bacteria and archaea, and then weld the NCBI euk taxonomy onto that. Seems ugly.