Open ctb opened 4 years ago
ok, finally re-read my own blog post and realized this is exactly what sourmash oddify is doing. excellent.
this kind of analysis could be a follow-on module to just_taxonomy.py
, in which we take the newly cleaned genome sequences & their (inferred or labeled) lineages, build an LCA database from them, and then decontaminate them further based on high-rank k-mers indicative of cross-rank contamination.
alternatively, this is actually a kind of nice post-charcoal evaluation procedure for our initial publication - how much confusion did we remove from the per-genome approach, evaluated using a whole-data set analysis? (per https://github.com/dib-lab/charcoal/issues/77#issuecomment-632701004)
we could focus in on the no-ident contigs - output them in just_taxonomy, and then analyze them in a separate cross-MAG integrative step for oddify-like results.
e.g. for
SRR4033069_bin.1.*report.txt
we have a genome that is mostly unidentified. but we might see that some of these bits belong to other MAGs in this same collection. this is kind of similar to what sourmash-oddify is doing, come to think of it.this might be a v2 kind of thing.