Pipeline for customized phylogeny based on user provided gene/protein sequence(s) using Open Tree of Life data

pandurang-kolekar commented 10 years ago

I would like to propose the idea for "Pipeline for customized phylogeny based on user provided gene/protein sequence(s) using Open Tree of Life data".

Suppose a researcher has newly sequenced a gene/protein from a known species and wish to carry out phylogentic analysis of these sequences with existing orthologs in nearby taxonomic ranks (genus, family, order, class etc). This is a frequent lab exercise. In such cases, researcher compile and curate the ortholog sequence data from relevant databases (Genbank, ENA, Swissprot, Uniprot etc). Then add newly sequenced gene/protein to this data set and follow the molecular phylogeny analysis protocol. So every time a new gene/protein is sequenced one has to repeat this time consuming process of data curation, compilation and phylogeny.

I would like to propose an idea to expedite this process using the resources at Open Tree of Life.

Based on the gene/protein in question (e.g. 16s rRNA, gyrB etc) and taxonomic rank (name of species, genus, family etc) a custom script will search for the NeXML files in tree repositories (TreeBASE, DRYAD etc) to fetch the related records (sequence alignment character matrix, OTUs, length, publication details etc).
Extract the OTUs and sequence alignment data.
Add user sequences to the fetched data and carry out molecular phylogeny analysis using pre-designed/customized setup (Mesquite, PHYLIP etc).
Output the phylogenetic tree for user interpretation.

This will help to characterize & annotate lab sequences and may even helps in species assignment or discovery of new species. It will save the time of such routine analyses.

Resources needed: TreeBASE, DRYAD, Arbor, Phylogeny packages (Bio::Phylo) etc. Experts may recommend few more.

kcranston commented 10 years ago

Open Tree doesn't have very many alignments at the moment - we have been focusing on trees. But, you could find trees with maximum overlap to your list of species and use that tree as a constraint in downstream analyses. @mtholder is also interested in updating existing phylogenies with new sequence data.

pandurang-kolekar commented 10 years ago

Thanks for suggestions! Trees with maximum overlap would be good start point. Based on the taxonomic proximity of the input species, other OTUs can be removed/retained in downstream analyses. I would like to discuss with @mtholder about his strategy towards the same.

chinchliff commented 10 years ago

I think this might be out of scope for opentree. There are some other tools that could be useful for this, like PHLAWD, phylota, and others you've mentioned such as treebase, etc.

pandurang-kolekar commented 10 years ago

Thanks for the information about PHLAWD @chinchliff . As far as my knowledge is concerned, Phylota archives only eukaryotic genera. If that can be linked to TreeBASE and other bacterial, viral databases it will help to broaden its scope.

alexharkess commented 10 years ago

@pandurang-kolekar You might be interested in PUmPER (http://sco.h-its.org/exelixis/web/software/put/index.html) from the Exelixis lab that seems to do just this.

mtholder commented 10 years ago

Sorry. I had missed this thread. I'm happy to chat about this. I won't be at the hackathon in person, but will be participating remotely (from the Exelixis lab, in fact).

pandurang-kolekar commented 10 years ago

PUmPER (http://sco.h-its.org/exelixis/web/software/put/index.html) works on the similar principle. Thanks @alexharkess ! @mtholder I will explore the PUmPER then we can chat about this.

pandurang-kolekar commented 10 years ago

@alexharkess @mtholder I read the application note on PUmPER (http://bioinformatics.oxfordjournals.org/content/30/10/1476.long).

To summarize it allows user to create a multiple sequence alignment (MSA) from the scratch or extend the existing MSA using PHLAWD. This step requires the gene name(s) and NCBI taxonomic group as an input in configuration file for PHLAWD.

The MSA is then given as an input to ExaML or RAxML-Light to infer phylogenetic tree. The program can be run in standalone or remote mode using command line.

But I don't know whether it accepts the user provided sequence(s), which are not available in GenBank. I have sent an email to corresponding author of the PUmPER to inquire about this.

Its available for Linux OS only. Availability of user friendly server would be helpful for researchers having no/less computational background.

pandurang-kolekar commented 10 years ago

I didn't get any reply from the authors of PUmPER. @mtholder What are your views on this project idea?

daisieh commented 10 years ago

Our aTRAM pipeline might be helpful for this too: it can generate multiple gene alignments across multiple taxa from whole genome shotgun reads.

OpenTreeOfLife / hackathon

Pipeline for customized phylogeny based on user provided gene/protein sequence(s) using Open Tree of Life data #19