dib-lab / genome-grist

map Illumina metagenomes to genomes!
https://dib-lab.github.io/genome-grist/
Other
36 stars 6 forks source link

comparing and evaluating DNA vs protein matches #40

Open ctb opened 3 years ago

ctb commented 3 years ago

So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!

summary: DNA and protein

so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).

Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).

taxonomy report on DNA search

default config, against genbank.

phylum level DNA: dna gathergram-phylum-SRR5855411

genus level DNA: dna gathergram-genus-SRR5855411

species level DNA: dna gathergram-species-SRR5855411

taxonomy report on protein search

config:

outdir: outputs.roux.protein/
sourmash_scaled: 1000
sourmash_compute_ksizes:
- 11
sourmash_sigtype: protein
sourmash_database_glob_pattern: /home/ntpierce/thumper/databases/gtdb95-genus-n0.protein-k11-scaled100.sbt.zip
sourmash_database_ksize: 33
sourmash_database_threshold_bp: 50e3

phylum level protein: prot gathergram-phylum-SRR5855411

genus level protein prot gathergram-genus-SRR5855411

species level protein: prot gathergram-species-SRR5855411

ctb commented 3 years ago

so the question becomes, how do we evaluate this? random thoughts and notes.

and just to be clear, this is all @bluegenes fault.

ctb commented 3 years ago

Ended up running this on podar, with some pretty good results. I've put the specific data sets and notebooks here, for now: https://github.com/ctb/2021-podar-gathertax

taylorreiter commented 3 years ago

I think paladin would be good! Paladin is basically a drop in replacement for BWA, but instead of mapping nucleotide reads to to nucleotide genomes/references, it maps nucleotide reads to protein genomes/references. I'm not sure what makes the most sense for getting the protein reference. I'm pretty sure some genbank genomes have it, but to play nice with genome-grist I think running prokka on the nucleotide genome and using the .faa annotated files that result to map against would probably make the most sense.

ctb commented 3 years ago

Thanks taylor!

This gels with some thoughts I had the other day - thought process,

so the best is to map reads to a proteome with (e.g.) paladin, and estimate fraction covered that way. We still face the challenge that the proteome is less than 100% of the genome (obviously :) but we could either simply state that or correct for it in our output.

ctb commented 3 years ago

@bluegenes suggests that we try lenient mapping in DNA space as well!