dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

charcoal use cases / ecosystem interactions #14

Open ctb opened 4 years ago

ctb commented 4 years ago

For the taxonomy filtering script that's the default action right now, just_taxonomy.py, the pipeline does its own database search to determine taxonomy by looking at the majority LCA match. If it can't get a genus-level assignment, it punts. This works OK on HMP MAGs, but is going to work ...poorly on environmental genomes.

I think it'd probably be fine to allow people to provide a taxonomy spreadsheet of their own, based off of (e.g.) a GTDB-Tk run or something. I don't think we should put running GTDB-Tk itself into the charcoal pipeline, although I guess we could - sourmash-oddify already supports that. But I don't want to make it a requirement.

ctb commented 4 years ago

I guess another question here is, where are we expecting people to use charcoal?

I think our main use cases internally to the lab are mostly, "I got a bunch of MAGs from somewhere else, and now I want to fix them." So this assumes they've been through at least some minimal pipeline (CheckM? Anvio? or whatever people use).

But we are also hearing from people that they want to apply this "de novo" to MAGs that they have just generated themselves and that maybe haven't been filtered by any other technique. This changes the nature of the game a fair bit.

Do we want to suggest that people use this only after other approaches? And how narrowly do we want to define those other approaches? And how does this change the value proposition of charcoal, which to me is this: "charcoal is a fast away to polish your MAGs by removing really obvious contamination."

taylorreiter commented 4 years ago

I think we should recommend that users always run CheckM. I think it's really valuable to know how many single copy marker genes are in the MAG, I think k-mers are too brittle to be a reliable estimate of completeness. I think BUSCO might also work for completeness estimation as they now have both bacteria and archaea databases, but I think CheckM is generally accepted as the standard for MAG QC.

So to me the question becomes the order in which we recommend these things. I'm not sure I have a clear opinion there yet :) I like thinking of charcoal as, "a fast away to polish your MAGs by removing really obvious contamination." Will Charcoal still work when there is A LOT of contamination? Will charcoal still work on really incomplete MAGs?

ctb commented 4 years ago

do we want to include GTDB-Tk and checkm execution as part of the normal charcoal foo?

taylorreiter commented 4 years ago

I would rather not...they're cumbersome to manage with databases and stuff, and are a lot more computationally intensive than charcoal. I see charcoal and a separate but complementary tool. I'm open to other ideas though!

ctb commented 4 years ago

they could be options, though, and I already have code for GTDB-Tk. I suspect that in the process of evaluating and writing up charcoal, we'll find that it's simpler to embrace the whole thing...

Also! Think of the convenience for the user - you know, the users that actually get it all to work 😂

taylorreiter commented 4 years ago

I guess from that perspective that's true -- charcoal could be more of a MAG QC ecosystem, where the user has to think less about getting the thing to run and can think more about the results. And we're definitely not trying to replace checkm, just add other information to QC stats.

Here's the atlas checkm code if its useful:

https://github.com/metagenome-atlas/atlas/blob/3931580fe1b08c0df77fd139b2c532436ee80221/atlas/rules/binning.snakefile#L310

And the download script for the checkm dbs:

https://github.com/metagenome-atlas/atlas/blob/bef1f78f3e1c745fd7df0577807a52f54a164e55/atlas/rules/download.snakefile#L152