elizabethmcd / metabolisHMM

Tool for constructing phylogenies and summarizing metabolic characteristics based on curated and custom profile HMMs
GNU General Public License v3.0
17 stars 5 forks source link

tfdA test case #13

Closed elizabethmcd closed 5 years ago

elizabethmcd commented 5 years ago

I'm a little too far down the road to fix some of this stuff for my mehg analysis. And that one is a little tricky with all the different datasets I pulled from and trying to make comparative analyses with HGT and whatnot.

A good thing to try would be to wrap up that project and try the presence/absence among the whole tree of screened genomes (TOL) for tfdA since it seems to be a little more widespread and not as rare as the mehg crap.

Steps that would have to be taken into account:

  1. Pulling down refseq genomes
  2. Pulling down large-scale, publicly available genome sets (Anantharaman, Woodcroft, Crits-Cristoph, Tran, Parks) and then dereplicating by some threshold so don't have a bunch of duplicated bins across datasets that are probably very similar in sequence and are just going to add to the mess of the tree
  3. This is the genome database to screen from = have nucleotide and protein files stored somewhere (OSF?)
  4. Then can create all the presence/absence analyses with tfdA as a test case and looking closer into that
  5. Fastree will be implemented in the pipeline as a test to look at what things pop out, but with strong recommendations to run with RaxML on servers with more computing power because fastree be sucky
elizabethmcd commented 5 years ago

I believe I fixed this is my genomes-MAGs-database repository to not be biased by specific datasets with my condor pipeline. Just not putting it on a TOL for presence/absence - unless pulling down a representative from each phyla, but would take a while for something to look pretty