dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

use metagenome search for validating results? #145

Open ctb opened 4 years ago

ctb commented 4 years ago

there are reasons why this kind of thing wouldn't always work, but even that could be interesting.

we could use Luiz's MAG search (https://blog.luizirber.org/2020/07/24/mag-results/) to see if putative contaminant contigs don't fit in with metagenome search.

separately (and this is kind of related to some of the stuff we did in the first round of charcoal development, lo! these many moons ago) could certainly imagine doing large-scale contig abundance examination across metagenomes to see if contig abundances do not agree.

last but not least, use metagenome search on contaminated bins -> spacegraphcats to see if neighborhoods overlap.

ctb commented 4 years ago

from slack chatter:

The basic idea is to find relevant metagenomes using MAG search x metagenomes (potentially on just a small collection of local metagenomes), and then do an sgc query on each contig, and then ask for which contigs, if their sgc neighborhoods overlap, and if they do, declare “ok well at least they’re graph proximal so maybe not contamination.”

(underlying logic: contigs that are contaminants probably belong to very separate parts of the graph)

This ties in nicely with the very useful feature that taylor and I added to sgc, https://github.com/spacegraphcats/spacegraphcats/pull/282, where you can do a multifasta query mode and annotate hash values with the results. One could imagine applying this to charcoal results in some as-yet-dim-in-my-mind way, where you maybe annotate hash values with the contigs in their neighborhood and do some kind of post-processing on taxonomy.

OOH post processing on taxonomy!!!

OK, to be a bit more explicit:

  1. take contigs <-> hash value taxonomy annotations from gather x GTDB, per charcoal
  2. using multifasta query, transfer those taxonomic annotations into the graph neighborhoods
  3. count the number of times a hash is annotated as a particular tax, because that indicates there's a proximal contig with that annotation. use absence of that as ...a problem?

We could also look for "taxonomic confusion" where hashes get annotated with multiple confounding taxonomies due to their neighborhoods, but I think this isn't as useful as the above.

separately or together, one could certainly imagine annotating an entire sgc cDBG graph with taxonomies and examining that in some way; we have the power to do the annotation now. it's sort of an extension of the charcoal approach to unitigs rather than contigs. I guess we could look for dominators that have confounding taxonomic annotations, and ask at what level they are messed up?

taylorreiter commented 4 years ago

I like the idea of digging in deep into the GTDB contaminated genomes especially if possible, as well as a few other MAGs, using these techniques. I think it might also help us develop an intuition for when binning is most likely to fail. I know people have done this before, but using assembly graphs seems like it would add an extra layer of information