Kitware / flow

Full data science workflows on the web
http://resonant-flow.readthedocs.org
Apache License 2.0
20 stars 15 forks source link

Read Nexus tree or tree/matrix files #89

Open curtislisle opened 9 years ago

curtislisle commented 9 years ago

Nexus files are used often in phylogenetics. Instead of having to support our own parsers, we should adopt mature parsers if they exist. The parser below handles Nexus and Newick files into R with more reliability than ape, and uses the NCL (Nexus class library).

http://francoismichonneau.net/2014/12/rncl/

curtislisle commented 8 years ago

In the attached ZIP is a simple tree in Nexus and a corresponding character matrix. We need to be able to add reading of this type to Arbor. A lot of existing packages will output in this format.

nexus_example_data.zip

curtislisle commented 8 years ago

I know we have simple Nexus tree reading, but this format is complex. There is a very complete C++ implementation of the NEXUS spec available here. maybe we can use this to parse to our intermediate tree representation:

https://github.com/mtholder/ncl

curtislisle commented 8 years ago

Flow currently assumes nexus file extensions to be trees. This is not correct. Nexus is a file type which can (and often does) contain either trees, matrices, or both in a single file. Multiple trees and multiple matrices can be stored in a single Nexus file. Reading Nexus successfully is fairly critical for widespread adoption of Arbor.

jeffbaumes commented 8 years ago

I can take a look. It seems that a new "trees_tables" type is appropriate for Nexus files.

curtislisle commented 8 years ago

Thanks. This isn’t urgent, but I’d like to work on this over the next few weeks/months.

On Aug 16, 2016, at 5:45 AM, Jeffrey Baumes notifications@github.com wrote:

I can take a look. It seems that a new "trees_tables" type is appropriate for Nexus files.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Kitware/flow/issues/89#issuecomment-240090622, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDZ9vFNOcpHkT942BCOkpCFFK92ndBSks5qgbDkgaJpZM4DSGzB.

jeffbaumes commented 8 years ago

It is clear that there can be zero, one, or more trees in a nexus file, and it is clear that there can be zero or one matrices. What is not clear is whether there can be more than one matrix (or if in practice this ever happens). This paper seems to document the nexus format better than anything else I've seen http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf. To do this right we should have a collection of nexus files of all shapes and sizes and perform testing on all of them to ensure they are all supported.

If there can be any number of trees or tables, a few workflows might make sense. I prefer a trees_tables type which has nexus format and one or more in-memory formats based on the library used to read it (such as ape or ncl). There could then be standard "Select Nexus Tree" and "Select Nexus Matrix" analyses in Flow which input a trees_tables (nexus file) and a selector (tree/table index or name) and output a single tree/table in one of the supported formats. So a workflow may look something like this:

img_0897

curtislisle commented 8 years ago

I agree to this approach of having the combined format and selector steps in a workflow. I am working with David Maddison this week. I'll ask him for samples and how many trees / matrices are allowed per file.

On Aug 16, 2016, at 10:17 AM, Jeffrey Baumes notifications@github.com wrote:

It is clear that there can be zero, one, or more trees in a nexus file, and it is clear that there can be zero or one matrices. What is not clear is whether there can be more than one matrix (or if in practice this ever happens). This paper seems to document the nexus format better than anything else I've seen http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf. To do this right we should have a collection of nexus files of all shapes and sizes and perform testing on all of them to ensure they are all supported.

If there can be any number of trees or tables, a few workflows might make sense. I prefer a trees_tables type which has nexus format and one or more in-memory formats based on the library used to read it (such as ape or ncl). There could then be standard "Select Nexus Tree" and "Select Nexus Matrix" analyses in Flow which input a trees_tables (nexus file) and a selector (tree/table index or name) and output a single tree/table in one of the supported formats. So a workflow may look something like this:

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.