Open ctb opened 3 years ago
so the question becomes, how do we evaluate this? random thoughts and notes.
and just to be clear, this is all @bluegenes fault.
Ended up running this on podar, with some pretty good results. I've put the specific data sets and notebooks here, for now: https://github.com/ctb/2021-podar-gathertax
I think paladin would be good! Paladin is basically a drop in replacement for BWA, but instead of mapping nucleotide reads to to nucleotide genomes/references, it maps nucleotide reads to protein genomes/references. I'm not sure what makes the most sense for getting the protein reference. I'm pretty sure some genbank genomes have it, but to play nice with genome-grist I think running prokka on the nucleotide genome and using the .faa
annotated files that result to map against would probably make the most sense.
Thanks taylor!
This gels with some thoughts I had the other day - thought process,
so the best is to map reads to a proteome with (e.g.) paladin, and estimate fraction covered that way. We still face the challenge that the proteome is less than 100% of the genome (obviously :) but we could either simply state that or correct for it in our output.
@bluegenes suggests that we try lenient mapping in DNA space as well!
So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!
summary: DNA and protein
so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).
Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).
taxonomy report on DNA search
default config, against genbank.
phylum level DNA:![dna gathergram-phylum-SRR5855411](https://user-images.githubusercontent.com/51016/103100511-c1f80f00-45c7-11eb-8a5f-55c58cdc071a.png)
genus level DNA:![dna gathergram-genus-SRR5855411](https://user-images.githubusercontent.com/51016/103100949-f7056100-45c9-11eb-89b4-bd989e46e1f0.png)
species level DNA:![dna gathergram-species-SRR5855411](https://user-images.githubusercontent.com/51016/103100523-d20fee80-45c7-11eb-9171-cac1dd476d11.png)
taxonomy report on protein search
config:
phylum level protein:![prot gathergram-phylum-SRR5855411](https://user-images.githubusercontent.com/51016/103100545-f23fad80-45c7-11eb-9b19-df1a1c1dd029.png)
genus level protein![prot gathergram-genus-SRR5855411](https://user-images.githubusercontent.com/51016/103100963-05ec1380-45ca-11eb-8633-be0e15db1650.png)
species level protein:![prot gathergram-species-SRR5855411](https://user-images.githubusercontent.com/51016/103100548-f2d84400-45c7-11eb-9283-0381214d8ecd.png)