AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
199 stars 25 forks source link

Table with location(s) of SCG markers #69

Closed gkatahon closed 1 year ago

gkatahon commented 1 year ago

Hi Mike,

Thanks again for all the work in updating and improving the tool. Still works like a charm.

I was wondering if there is a way to output the location of the SCG markers in the genomes used? I frequently use newly binned MAGs, but chances are these still contain contaminated contigs. Therefore it would be great to know where the SCGs are found. As such, these contigs could be marked as trusted and subsequently used for bin improvement steps. Likewise, SCGs with multiple hits would hopefully be placed on different contigs, meaning these can be checked separately to see if some are potentially contamination.

I tried to find this in the output files of GToTree, but was unable to find such a file. Is there an easy way this can be created?

Cheers Guillaume

AstrobioMike commented 1 year ago

Hey there, Guillaume :)

Unfortunately it would take quite a bit of changing things under the hood to retain this info. And there wouldn't really be an infrastructure in place to incorporate that info into binning or MAG-refinement programs afterwards, so it would really only be beneficial to folks who can do a lot of manipulation of things computationally. So for the typical user, I'd say it's better to stick with programs built for MAG-refinement, particularly ones that visually allow us to inspect coverage (arguably the most useful metric, especially when across multiple samples is available). For instance i'd recommend anvio if you're not familiar with it yet (here's a page on MAG-refinement with anvio).

Thanks for the suggestion, and I'll keep it in mind if/when I end up refactoring things. Sorry this kind of thing wasn't the goal at first, and as such would take a lot of work to implement now.

If you still want to pursue this for your input MAGs, I can pull out example code for how the relevant steps are run (gene calling, the HMM search) by GToTree and share them here, so you could just run them in the GToTree environment.

Let me know :)