Closed blahah closed 9 years ago
I use a lot of trees in which tips are associated with a time/date; in some cases, these trees may be ultrametric (after taking into account the tip dates; for example, the output from BEAST). Any ideas on what the API should be for these? Additional methods for these types could include extracting coalescent/sampling intervals.
Hi,
Recently I've been trying to build a type for representing trees that can be annotated with all manner of data the user might want to annotate them with - spurred on by efforts to read in the PhyloXML format which itself also allows extension for user defined data and annotations. The result is intended to be types for flexibly and easily extending trees in Julia, that could then be used in the creation of specific tree types. I haven't used BEAST for a while so don't remember the output files and formats, but working with popular programs like BEAST is definitely something I would like Phylo to be able to do - perhaps it can be the first practical application of the extendible format once I'm done with it (won't be long now).
Best,
Ben J. Ward PhD Student, UEA(ENV) & The Sainsbury Laboratory, UEA: b.ward@uea.ac.uk TSL: ben.ward@sainsbury-laboratory.ac.uk Github: https://github.com/Ward9250
From: sdwfrost [notifications@github.com] Sent: 07 July 2014 09:34 To: BioJulia/Bio.jl Cc: Ben Ward (TSL) Subject: Re: [Bio.jl] Phylo submodule review and roadmap (#13)
I use a lot of trees in which tips are associated with a time/date; in some cases, these trees may be ultrametric (after taking into account the tip dates; for example, the output from BEAST). Any ideas on what the API should be for these? Additional methods for these types could include extracting coalescent/sampling intervals.
— Reply to this email directly or view it on GitHubhttps://github.com/BioJulia/Bio.jl/issues/13#issuecomment-48152859.
Since PhyloXML is getting a lot of mention here, I was wondering whether you are all aware of NeXML. It's the format used in Phenoscape and Open Tree of Life.
Vos et al. 2012. “NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata.” Systematic Biology 61 (4): 675–89. http://dx.doi.org/10.1093/sysbio/sys025
Hi,
I don't know about everyone else, but I am aware of NeXML but I haven't yet used any programs for which it is used. I will make a note that people want it.
Best,
Ben J. Ward PhD Student, UEA(ENV) & The Sainsbury Laboratory, UEA: b.ward@uea.ac.uk TSL: ben.ward@sainsbury-laboratory.ac.uk Github: https://github.com/Ward9250
From: Hilmar Lapp [notifications@github.com] Sent: 07 July 2014 18:42 To: BioJulia/Bio.jl Cc: Ben Ward (TSL) Subject: Re: [Bio.jl] Phylo submodule review and roadmap (#13)
Since PhyloXML is getting a lot of mention here, I was wondering whether you are all aware of NeXMLhttp://www.nexml.org. It's the format used in Phenoscapehttp://phenoscape.org and Open Tree of Lifehttp://opentreeoflife.org.
Vos et al. 2012. “NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata.” Systematic Biology 61 (4): 675–89. http://dx.doi.org/10.1093/sysbio/sys025
— Reply to this email directly or view it on GitHubhttps://github.com/BioJulia/Bio.jl/issues/13#issuecomment-48211512.
I chose PhyloXML over NexML when I started on a relevant project because it was easier to understand and manipulate in the libraries I found. But I have to say neither standard has well-documented libraries. The schema are documented and the libraries I know of have some class/method documentation, but there's a big conceptual leap missing. I think that's something we should be careful to address in our documentation. And I think we should support both PhyloXML and NexML.
Whether NeXML has well documented libraries depends on the language and purpose you're interested in. For example RNeXML is reasonably well documented. So is Bio::Phylo in Perl. I would argue, however, that most library-level projects are better off coding their APIs to XML formats from scratch using the available XML toos as their interface. This is what RNeXML, Phenoscape, and Open Tree did. Layering "generic" libraries on top of XML libraries typically results either in a product that is not generic, or doesn't perform well, and generally gains little.
So in short, all that's needed for a library is a XML parsing library. Julia has that, no?
Julia does indeed have a XML parsing library - LightXML, this is already utilized in Phylo for the parsing of PhyloXML, I see no reason NeXML can't be used - I believe parsing NeXML is one of the targets on the road-map issue and so I will get to it in time. This week I plan to make several commits and update with the targets I've had a crack at, ticked off.
Best,
Ben J. Ward PhD Student, UEA(ENV) & The Sainsbury Laboratory, UEA: b.ward@uea.ac.uk TSL: ben.ward@sainsbury-laboratory.ac.uk Github: https://github.com/Ward9250
From: Hilmar Lapp [notifications@github.com] Sent: 08 July 2014 14:38 To: BioJulia/Bio.jl Cc: Ben Ward (TSL) Subject: Re: [Bio.jl] Phylo submodule review and roadmap (#13)
Whether NeXML has well documented libraries depends on the language and purpose you're interested in. For example RNeXMLhttps://github.com/rOpenSci/RNeXML is reasonably well documented. So is Bio::Phylohttps://github.com/rvosa/bio-phylo in Perl. I would argue, however, that most library-level projects are better off coding their APIs to XML formats from scratch using the available XML toos as their interface. This is what RNeXML, Phenoscape, and Open Tree did. Layering "generic" libraries on top of XML libraries typically results either in a product that is not generic, or doesn't perform well, and generally gains little.
So in short, all that's needed for a library is a XML parsing library. Julia has that, no?
— Reply to this email directly or view it on GitHubhttps://github.com/BioJulia/Bio.jl/issues/13#issuecomment-48336511.
@hlapp I agree - I was explaining why I personally chose to use PhyloXML rather than raising a barrier to implementing NexML in Bio.jl. RNeXML is nice - hadn't seen that before.
And as @Ward9250 points out, it's listed as a to-do in the review above, under IO.
The thing I'm currently thinking about is the PhyX format of tree and your comments @Blahah ". In this format, a variable containing a tree is really a pointer to the root node, which knows that it is a root and what its children are. Each child node knows it parent, might know how long the branch is from its parent, and knows what its children are." This sounds like a doubly linked list the type you might be taught about in C++ texts. I've not seen many examples in Julia of people programming with pointers in this sort of way. See Ward9250/Phylogenetics.jl PhyX branch for how PhyXElements are built - each element simply occupies one place in an array, and each has an integer variable that is parent, such that to get the parent of a node you'd do something like this ArrayOfNodes[ArrayOfNodes[5].Parent]. I'm not sure I like this so much and would like to try and arrange things so each element points to its parent and it's children. I'm confident the current recursive building function will easily adapt to this. I just haven't seen pointers used explicitly in Julia code yet - does anyone else know how?
@Blahah I've been going over the TODO list more and the comments re having a recursive structure for trees instead of the adjacency lists of Clado and Phylo. I've made a prototype file for coding the extensible trees I've been building type as a recursive structure instead of a flat array. It is built depth first. It is in this Gist: https://gist.github.com/Ward9250/f5442187da3d379b5345 Let me know what you think of this prototype - the recursive building code is a lot simpler than for the flat form. I'm interested to know what you think. AFAIK defined types are passed by reference, so I think I can have it so as the nodes point to their parents as well as their children. By incompletely initializing a node, and then passing it to the recursive function that builds its children.
@Ward9250 Should we keep this issue open, or is it redundant with #17?
I thin this is ok to close. Because this was a review of the very first pass at a Phylo module. And the current one is now waaaaay different to the original.
Phylo code review - 20th April 2014
General notes
Overall very good, and a great start for the Bio.jl phylo module. Here I have attempted to map out what would be required to get to a version 1, high quality phylogenetics module to ship with Bio.jl that covers the major use cases.
This review covers:
Most of the review is a to-do list, so that as points are addressed they can be checked off.
1. API
Currently there are methods for the following operations on trees:
We should define the minimal API for this library. I suggest it should include (in addition to the existing functionality):
Node/tree manipulation
Tree exploration
Tree measurement
Tree visualisation
IO
2. Design
The module is structured around the abstract type
Phylogeny
, representing a phylogeny. It has subtypes Clado and Phylo representing Cladograms and Phylograms respectively.These types represent the tree as an adjacency list with associated arrays of properties such as node names and edge lengths.
I would like to query two aspects of this design:
I think these two queries should be resolved before any of the remaining sections are addressed.
3. Style points
joinpath
.readtree
is consistent whereastreebuild
andtreewrite
are not. Should bebuildtree
andwritetree
. Similarly,cladobuild
should bebuildclado
, etc.ASCIIStrings
? I think all operating systems now allow unicode paths. If this is true, we should be more liberal with theString
type we allow.nodeLabel
,tipLabel
,treeNames
, etc. These should be moved to all-lowercase.cladobuild
,treebuild
,newick
, etc. should be lowercased. e.g.AddInteral
->addinternal
.4. Testing
Test coverage is quite good - around 95% of the code is hit in tests.
--code-coverage
shows one constructor and several parts of functions that are not covered - in particular conditional statements inside large functions. We should aim for 100% coverage.cladobuild
, the conditional block (if search(tp, ",") == 0:-1
) to handle the case where a tree has no commas is not tested.GoDown
, the same conditionl block (if search(tp, ",") == 0:-1
) is not tested.treewrite
, the case whereappend == true
is not tested.newick
, the conditionalif j > N
is not tested.addTerminal
, the case wherename == false
is not tested.5. Repetition
build
methods. Eliminating repetition by moving repeated code to functions makes maintenance easier and reduces method complexity. Functions or blocks that are either completely or mostly repeated should be broken out into functions with higher scope so all methods with the shared code can call them.AddInternal
is exactly the same incladobuild
andtreebuild
.treewrite
for the case of a single tree simply add the single tree to an array and call theArray
version?newick()
share a lot of code6. Documentation
Currently, the documentation is in the form of comments on the main functions. This is sufficient to use the code, but we should expand this to full standalone documentation for all types, functions and methods and a tutorial highlighting common use-cases.