Phylo submodule review and roadmap

blahah commented 10 years ago

Phylo code review - 20th April 2014

General notes

Overall very good, and a great start for the Bio.jl phylo module. Here I have attempted to map out what would be required to get to a version 1, high quality phylogenetics module to ship with Bio.jl that covers the major use cases.

This review covers:

the API
design of the module
minor style points
testing
repetition
documentation

Most of the review is a to-do list, so that as points are addressed they can be checked off.

1. API

Currently there are methods for the following operations on trees:

parsing/lexing newick format
parsing/lexing PhyloXML (in progress)
getting the root node
getting the children of all nodes in a tree simultaneously
test whether two trees are equal
produce a unique hash from a tree

We should define the minimal API for this library. I suggest it should include (in addition to the existing functionality):

Node/tree manipulation

[x] get and set any property of a node
[x] add child nodes
[x] add sibling nodes
[x] remove child nodes
[x] delete nodes (with or without connecting their descendents to the parent)
[x] detach the node (delete it from the original tree and return it as a new tree)
[x] re-root a (sub)tree
Tree exploration
[x] get the children of a node
[x] check if the node is a leaf (has no children)
[x] get the parent of a node
[x] check if the node is root
[x] get the siblings of a node
[x] get all the descendents of a node
[x] get all the leaves of a (sub)tree (i.e. all terminal descendents of a node)
[x] depth-first search of a (sub)tree
[x] breadth-first search of a (sub)tree
Tree measurement
[x] Get the shortest path between two nodes
[x] Get the distance (aggregate branch length) between two nodes
[ ] Check whether a (sub)tree is monophyletic for a given taxonomic rank
[x] Find the midpoint outgroup
Tree visualisation
[x] print a text representation
[x] print an image representation
[ ] produce an interactive representation
[ ] annotate these visualisations with arbitrary information from the tree
IO
[ ] parse/lex nexus
[ ] parse/lex NexML
2. Design

The module is structured around the abstract type Phylogeny, representing a phylogeny. It has subtypes Clado and Phylo representing Cladograms and Phylograms respectively.

These types represent the tree as an adjacency list with associated arrays of properties such as node names and edge lengths.

I would like to query two aspects of this design:

For many of the features listed above as desirable to include in the API, a natural representation of trees is as heirarchies of nodes. In this format, a variable containing a tree is really a pointer to the root node, which knows that it is a root and what its children are. Each child node knows it parent, might know how long the branch is from its parent, and knows what its children are. This is how the tree representations in BioPython/ETE and BioRuby work. Benefits of this system are that it is very easy to insert or remove parts of a tree, split a tree into subtrees, and recursively represent a tree in a text format. I can't think of any downsides (but happy to be corrected). We should consider adopting this representation unless there are major constraints on doing so in Julia.
Is it necessary to have separate Cladogram and Phylogram types? It seems to me that a Cladogram is easily represented by a tree where each node has a null branch length. A method on the root node could tell us whether the tree is a cladogram.

I think these two queries should be resolved before any of the remaining sections are addressed.

3. Style points

[x] Method names should be consistent with one another and with the Standard Library. STL uses "$(verb)$(noun)", e.g. joinpath. readtree is consistent whereas treebuild and treewrite are not. Should be buildtree and writetree. Similarly, cladobuild should be buildclado, etc.
[x] Why are file paths restricted to being ASCIIStrings? I think all operating systems now allow unicode paths. If this is true, we should be more liberal with the String type we allow.
[x] Style guide suggests variables should be all lower case, with underscore separators if absolutely necessary. We have, e.g. nodeLabel, tipLabel, treeNames, etc. These should be moved to all-lowercase.
[x] Style guide suggests function names should be lowercase - the names of internal functions of cladobuild, treebuild, newick, etc. should be lowercased. e.g. AddInteral -> addinternal.
4. Testing

Test coverage is quite good - around 95% of the code is hit in tests. --code-coverage shows one constructor and several parts of functions that are not covered - in particular conditional statements inside large functions. We should aim for 100% coverage.

[x] ReducedTopology is not tested.
[x] In cladobuild, the conditional block (if search(tp, ",") == 0:-1) to handle the case where a tree has no commas is not tested.
[x] In GoDown, the same conditionl block (if search(tp, ",") == 0:-1) is not tested.
[x] In treewrite, the case where append == true is not tested.
[x] In newick, the conditional if j > N is not tested.
[x] In addTerminal, the case where name == false is not tested.
5. Repetition
[x] There is some code repetition in the multiple versions of the build methods. Eliminating repetition by moving repeated code to functions makes maintenance easier and reduces method complexity. Functions or blocks that are either completely or mostly repeated should be broken out into functions with higher scope so all methods with the shared code can call them.
[x] AddInternal is exactly the same in cladobuild and treebuild.
[x] could treewrite for the case of a single tree simply add the single tree to an array and call the Array version?
[x] the two versions of newick() share a lot of code
6. Documentation

Currently, the documentation is in the form of comments on the main functions. This is sufficient to use the code, but we should expand this to full standalone documentation for all types, functions and methods and a tutorial highlighting common use-cases.

sdwfrost commented 10 years ago

I use a lot of trees in which tips are associated with a time/date; in some cases, these trees may be ultrametric (after taking into account the tip dates; for example, the output from BEAST). Any ideas on what the API should be for these? Additional methods for these types could include extracting coalescent/sampling intervals.

TransGirlCodes commented 10 years ago

Hi,

Recently I've been trying to build a type for representing trees that can be annotated with all manner of data the user might want to annotate them with - spurred on by efforts to read in the PhyloXML format which itself also allows extension for user defined data and annotations. The result is intended to be types for flexibly and easily extending trees in Julia, that could then be used in the creation of specific tree types. I haven't used BEAST for a while so don't remember the output files and formats, but working with popular programs like BEAST is definitely something I would like Phylo to be able to do - perhaps it can be the first practical application of the extendible format once I'm done with it (won't be long now).

Best,

Ben J. Ward PhD Student, UEA(ENV) & The Sainsbury Laboratory, UEA: b.ward@uea.ac.uk TSL: ben.ward@sainsbury-laboratory.ac.uk Github: https://github.com/Ward9250

From: sdwfrost [notifications@github.com] Sent: 07 July 2014 09:34 To: BioJulia/Bio.jl Cc: Ben Ward (TSL) Subject: Re: [Bio.jl] Phylo submodule review and roadmap (#13)

I use a lot of trees in which tips are associated with a time/date; in some cases, these trees may be ultrametric (after taking into account the tip dates; for example, the output from BEAST). Any ideas on what the API should be for these? Additional methods for these types could include extracting coalescent/sampling intervals.

— Reply to this email directly or view it on GitHubhttps://github.com/BioJulia/Bio.jl/issues/13#issuecomment-48152859.

hlapp commented 10 years ago

Since PhyloXML is getting a lot of mention here, I was wondering whether you are all aware of NeXML. It's the format used in Phenoscape and Open Tree of Life.

Vos et al. 2012. “NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata.” Systematic Biology 61 (4): 675–89. http://dx.doi.org/10.1093/sysbio/sys025

TransGirlCodes commented 10 years ago

Hi,

I don't know about everyone else, but I am aware of NeXML but I haven't yet used any programs for which it is used. I will make a note that people want it.

Best,

Ben J. Ward PhD Student, UEA(ENV) & The Sainsbury Laboratory, UEA: b.ward@uea.ac.uk TSL: ben.ward@sainsbury-laboratory.ac.uk Github: https://github.com/Ward9250

From: Hilmar Lapp [notifications@github.com] Sent: 07 July 2014 18:42 To: BioJulia/Bio.jl Cc: Ben Ward (TSL) Subject: Re: [Bio.jl] Phylo submodule review and roadmap (#13)

Since PhyloXML is getting a lot of mention here, I was wondering whether you are all aware of NeXMLhttp://www.nexml.org. It's the format used in Phenoscapehttp://phenoscape.org and Open Tree of Lifehttp://opentreeoflife.org.

Vos et al. 2012. “NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata.” Systematic Biology 61 (4): 675–89. http://dx.doi.org/10.1093/sysbio/sys025

— Reply to this email directly or view it on GitHubhttps://github.com/BioJulia/Bio.jl/issues/13#issuecomment-48211512.

blahah commented 10 years ago

I chose PhyloXML over NexML when I started on a relevant project because it was easier to understand and manipulate in the libraries I found. But I have to say neither standard has well-documented libraries. The schema are documented and the libraries I know of have some class/method documentation, but there's a big conceptual leap missing. I think that's something we should be careful to address in our documentation. And I think we should support both PhyloXML and NexML.

hlapp commented 10 years ago

Whether NeXML has well documented libraries depends on the language and purpose you're interested in. For example RNeXML is reasonably well documented. So is Bio::Phylo in Perl. I would argue, however, that most library-level projects are better off coding their APIs to XML formats from scratch using the available XML toos as their interface. This is what RNeXML, Phenoscape, and Open Tree did. Layering "generic" libraries on top of XML libraries typically results either in a product that is not generic, or doesn't perform well, and generally gains little.

So in short, all that's needed for a library is a XML parsing library. Julia has that, no?

TransGirlCodes commented 10 years ago

Julia does indeed have a XML parsing library - LightXML, this is already utilized in Phylo for the parsing of PhyloXML, I see no reason NeXML can't be used - I believe parsing NeXML is one of the targets on the road-map issue and so I will get to it in time. This week I plan to make several commits and update with the targets I've had a crack at, ticked off.

Best,

Ben J. Ward PhD Student, UEA(ENV) & The Sainsbury Laboratory, UEA: b.ward@uea.ac.uk TSL: ben.ward@sainsbury-laboratory.ac.uk Github: https://github.com/Ward9250

From: Hilmar Lapp [notifications@github.com] Sent: 08 July 2014 14:38 To: BioJulia/Bio.jl Cc: Ben Ward (TSL) Subject: Re: [Bio.jl] Phylo submodule review and roadmap (#13)

Whether NeXML has well documented libraries depends on the language and purpose you're interested in. For example RNeXMLhttps://github.com/rOpenSci/RNeXML is reasonably well documented. So is Bio::Phylohttps://github.com/rvosa/bio-phylo in Perl. I would argue, however, that most library-level projects are better off coding their APIs to XML formats from scratch using the available XML toos as their interface. This is what RNeXML, Phenoscape, and Open Tree did. Layering "generic" libraries on top of XML libraries typically results either in a product that is not generic, or doesn't perform well, and generally gains little.

So in short, all that's needed for a library is a XML parsing library. Julia has that, no?

— Reply to this email directly or view it on GitHubhttps://github.com/BioJulia/Bio.jl/issues/13#issuecomment-48336511.

blahah commented 10 years ago

@hlapp I agree - I was explaining why I personally chose to use PhyloXML rather than raising a barrier to implementing NexML in Bio.jl. RNeXML is nice - hadn't seen that before.

blahah commented 10 years ago

And as @Ward9250 points out, it's listed as a to-do in the review above, under IO.

TransGirlCodes commented 10 years ago

The thing I'm currently thinking about is the PhyX format of tree and your comments @Blahah ". In this format, a variable containing a tree is really a pointer to the root node, which knows that it is a root and what its children are. Each child node knows it parent, might know how long the branch is from its parent, and knows what its children are." This sounds like a doubly linked list the type you might be taught about in C++ texts. I've not seen many examples in Julia of people programming with pointers in this sort of way. See Ward9250/Phylogenetics.jl PhyX branch for how PhyXElements are built - each element simply occupies one place in an array, and each has an integer variable that is parent, such that to get the parent of a node you'd do something like this ArrayOfNodes[ArrayOfNodes[5].Parent]. I'm not sure I like this so much and would like to try and arrange things so each element points to its parent and it's children. I'm confident the current recursive building function will easily adapt to this. I just haven't seen pointers used explicitly in Julia code yet - does anyone else know how?

TransGirlCodes commented 10 years ago

@Blahah I've been going over the TODO list more and the comments re having a recursive structure for trees instead of the adjacency lists of Clado and Phylo. I've made a prototype file for coding the extensible trees I've been building type as a recursive structure instead of a flat array. It is built depth first. It is in this Gist: https://gist.github.com/Ward9250/f5442187da3d379b5345 Let me know what you think of this prototype - the recursive building code is a lot simpler than for the flat form. I'm interested to know what you think. AFAIK defined types are passed by reference, so I think I can have it so as the nodes point to their parents as well as their children. By incompletely initializing a node, and then passing it to the recursive function that builds its children.

dcjones commented 9 years ago

@Ward9250 Should we keep this issue open, or is it redundant with #17?

TransGirlCodes commented 9 years ago

I thin this is ok to close. Because this was a review of the very first pass at a Phylo module. And the current one is now waaaaay different to the original.

BioJulia / Bio.jl