dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
53 stars 1 forks source link

can we leverage other MAGs in contamination analysis in any way? #36

Open ctb opened 4 years ago

ctb commented 4 years ago

e.g. for SRR4033069_bin.1.*report.txt we have a genome that is mostly unidentified. but we might see that some of these bits belong to other MAGs in this same collection. this is kind of similar to what sourmash-oddify is doing, come to think of it.

this might be a v2 kind of thing.

ctb commented 4 years ago

ok, finally re-read my own blog post and realized this is exactly what sourmash oddify is doing. excellent.

this kind of analysis could be a follow-on module to just_taxonomy.py, in which we take the newly cleaned genome sequences & their (inferred or labeled) lineages, build an LCA database from them, and then decontaminate them further based on high-rank k-mers indicative of cross-rank contamination.

alternatively, this is actually a kind of nice post-charcoal evaluation procedure for our initial publication - how much confusion did we remove from the per-genome approach, evaluated using a whole-data set analysis? (per https://github.com/dib-lab/charcoal/issues/77#issuecomment-632701004)

ctb commented 4 years ago

we could focus in on the no-ident contigs - output them in just_taxonomy, and then analyze them in a separate cross-MAG integrative step for oddify-like results.