I'm becoming rather fond of the protein search (#40) and want to combine DNA and protein searches (#48) but keep on running into mental problems about how to deal with taxonomy. This issue is to explain the issues to future me.
However, the challenge is that I used different databases! I used the Genbank compleat database (700k+ genomes) for DNA searching, and the GTDB genus-level protein database for protein searching. That's OK, because both databases have GenBank identifiers that let me pull out NCBI taxonomy.
I do prefer the GTDB taxonomy for things, but the majority of GenBank genomes don't have GTDB taxonomy assigned! I suppose we could relabel all of GenBank with GTDB...
In addition, GTDB doesn't have euk sequences in it, so we'd have to figure out how to add euk in. This now makes the question relevant to charcoal https://github.com/dib-lab/charcoal/issues/30, proving that ultimately all of our software gets stuck on the same set of hard problems :).
Anyway, for now I'm stuck using (a) NCBI taxonomy with (b) a protein database that doesn't contain euk sequences.
I'm becoming rather fond of the protein search (#40) and want to combine DNA and protein searches (#48) but keep on running into mental problems about how to deal with taxonomy. This issue is to explain the issues to future me.
In https://github.com/dib-lab/genome-grist/issues/40, I explored comparison of DNA vs protein taxonomy.
However, the challenge is that I used different databases! I used the Genbank compleat database (700k+ genomes) for DNA searching, and the GTDB genus-level protein database for protein searching. That's OK, because both databases have GenBank identifiers that let me pull out NCBI taxonomy.
I do prefer the GTDB taxonomy for things, but the majority of GenBank genomes don't have GTDB taxonomy assigned! I suppose we could relabel all of GenBank with GTDB...
In addition, GTDB doesn't have euk sequences in it, so we'd have to figure out how to add euk in. This now makes the question relevant to charcoal https://github.com/dib-lab/charcoal/issues/30, proving that ultimately all of our software gets stuck on the same set of hard problems :).
Anyway, for now I'm stuck using (a) NCBI taxonomy with (b) a protein database that doesn't contain euk sequences.