EcoJulia / Phylo.jl

Simple phylogenetic trees in Julia to work with Diversity.jl - https://github.com/EcoJulia/Diversity.jl
BSD 2-Clause "Simplified" License
34 stars 13 forks source link

Newick format is general and permits multifurcations #6

Closed mashu closed 4 years ago

mashu commented 6 years ago

Hi,

  1. Sometimes gene genealogy cannot be fully determined from the data, so it is not uncommon to have multi-furactions in gene trees. It seems package does not support them ? Would be nice to have that.
using Phylo;
t= parsenewick("((A,B)AB,C)ABCD;")  #Works
t= parsenewick("((A,B)AB,C,D)ABCD;") # Does not work
  1. Newick format in it's basic form can vary depending if support or distance values are provided, names for internal nodes, or it can be just a topological tree. I link ETE3 documentation page which shows table with example trees. Would be nice to support these various formats.

  2. Different strategies for traversing and exploring

    • First root toward leafs
    • First leafs towards root
    • Return node with name
    • Return parent
    • Return children of a node
    • Check if node is internal or external (leaf)
    • Return current node attributes (name, distance, support)
richardreeve commented 6 years ago

Okay, great, thanks. My thoughts:

  1. Sometimes gene genealogy cannot be fully determined from the data, so it is not uncommon to have multi-furactions in gene trees. It seems package does not support them ? Would be nice to have that. Shouldn't be a big problem - I have to specifically check for binary trees at the moment, because that's what I needed, but allowing non-binary trees should be possible

  2. Newick format in it's basic form can vary depending if support or distance values are provided, names for internal nodes, or it can be just a topological tree. I link ETE3 documentation page which shows table with example trees. Would be nice to support these various formats. Can you tell me which I don't currently support? I thought it could handle all of these, but I haven't checked... it doesn't record support information (or anything else though apart from branch length), it just discards it - that's the hardest thing to fix

  3. Different strategies for traversing and exploring

    • [x] General iterator - nodenameiter() to iterate over node names, nodeiter() to iterate over nodes, likewise branch...() for branches and ...filter() (instead of ...iter()) to iterate over the same but while filtering (e.g. for leaves)
    • [x] First root toward leafs; First leafs towards root - can I ask why you need these? It's perfectly possible to ask for the roots (or leaves) and then to iterate yourself using getchildren() or getparent() (or possibly getancestors() and getdescendants() ), but I don't really know why people would want to do it in general...
    • [x] Return node with name - nodeiter() and nodenameiter() will guarantee to use the same ordering, so you could zip() the iterators.
    • [x] Return parent - getparent()
    • [x] Return children of a node - getchildren()
    • [x] Check if node is internal or external (leaf) - isleaf() and isroot(), can be combined with e.g. nodenameiter(tree) to make nodenamefilter(isleaf, tree) to make a filtering iterator
    • [x] Return current node attributes (name) - nodenameiter() generates a name iterator
    • [x] Return current node attributes (distance) - distance() gives you the distance between two nodes, is this what you mean?
    • [ ] Return current node attributes (support) - not currently stored information

Do let me know if any of this isn't what you mean...

mashu commented 6 years ago

Regarding

    • Trees 0 and 2 contain support values, so they fail.
      Node 1.000000 already present in tree
    • Trees 1,3,7,8 contain internal node with the same name as external node, so they fail. I can't think of a reason to name internal nodes the same as leaf, but maybe it can happen. Python package allows all nodes to have the same name even for all leafs and internal nodes.
      from ete3 import Tree
      print(Tree("((B:0.723274,B:0.567784)E:0.067192,(B:0.279326,B:0.756049)B:0.807788);",format=1).get_ascii(show_internal=True))
  1. Why people want to iterate tree ?

    • I might want to process leafs while traversing tree differently. For example I might have two genes A_X and B_Y from species X and Y that are connected by internal node (speciation event) XY. Closer to root appears another gene C_X, which is a duplication that happened before speciation. Genes A_X and B_Y would be orthologus, C_X won't be to any of them. While traversing the tree I can extract A_X and B_Y as orthologus genes, and ignore C_X and B_Y pair.

    Why iteration and not a recursion, I am not a programmer, but I remember from somewhere that every recursion can be turned into iteration (not simple) and iterations are more efficient.

richardreeve commented 6 years ago

Ah, okay. I used Wikipedia to determine newick format in fact (also described here) - it even provides a set of grammar rules, which my code follows (though at the moment only for binary trees, and with an extension to allow metadata - including support values - to be stored inside [&...] as per nexus trees). Do you believe those rules are wrong? The ETE parser that you reference breaks those rules by allowing a number after ) to refer to a support value, and a string to refer to an internal node name. This seems to mean that you need to be incredibly careful in naming nodes as there will be misparsing if a name only includes numbers. The newick format within nexus tree format also frequently uses numbers as nodenames, which would be incompatible with this ETE-variant newick format. This seems like a poor breakage of compatibility between the two supposedly identical formats, and I'm reluctant to change things to use it when the nexus variant allows any arbitrary metadata (including support values) to be included with any node or branch... do have a strong reason for preferring ETE's (limited, incompatible) version?

I also have to admit I'm not remotely keen on duplicate node names - the names are used to store and retrieve the nodes - to name them in fact! - so having duplicate names would require a shadow naming scheme that was unique, which rather defeats the purpose of names in the first place. It also makes retrieval of a named node from a tree something that may return one or many nodes, which doesn't seem like a good idea either - I'm not sure why that would be desirable, do you?

mashu commented 6 years ago

I see your points and I do agree with most of your comments with small exception.

The ETE parser that you reference breaks those rules by allowing a number after ) to refer to a support value, and a string to refer to an internal node name.

I am surprised that naming internal nodes is not defined by standard Newick format. Unfortunately, my application requires names for internal nodes.

It also makes retrieval of a named node from a tree something that may return one or many nodes The branching point in the tree is also a node, so it could be returned as one node that has the property of having children. I think in both cases you would return one node.

I agree that not unique names, are not a good idea, in fact I can't think of a reason when that would be useful, nodes in my application are unique. But I do require naming internal nodes. If internal node naming works, and trees allow multifurcation I would be happy :+1:

Big thanks, for nice package and fast response !

richardreeve commented 6 years ago

Sorry - I think my response was confusing. Naming internal nodes is no problem in my code or the standard reference. The problem is in the ETE standard, where if an internal node name is a number it will be interpreted as a support value instead of as a name (as I understand what I'm reading there?!).

richardreeve commented 6 years ago

@mashu - do you want to try the polytomies branch in #7 to see if it does what you need? It's rough and ready, but you should be able to parse a tree with polytomous branching using parsenewick(string, NamedPolytomousTree) and open(io -> parsenewick(io, NamedPolytomousTree), file).

mashu commented 6 years ago

It works with tree I make by hand, but does not work with my trees, for example this one and fails on parsing

polytreestr = "(((((Q89S90:1.000,((Q89S67:1.000,Q74FM6:1.000)geobra5:1.000,(Q8AA21:1.000,(Q8RFB7:1.000,A9WG08:1.000)chlofus8:1.000)cfbac9:1.000)proteobacteria7:0.252,Q7NH47:1.000)gpl1427:1.000,((O06579:1.000,Q9RI40:1.000,Q9L2C5:0.600)actinomycetales612:1.
   ...: 000,(Q9WYA9:1.000,(B5YL47:1.000,(O67899:1.000,(Q7UWJ7:1.000,B8E0E7:1.000)dicrho2:1.000)draqui482:1.000)drathe5:1.000)dratthema3:1.000)drattsmtub13:1.000)gamproleipse89:1.000,(Q8TT31:1.000,(Q97ZL2:1.000,B1L5D4:1.000)korsul405:1.000)archaea18:1.000
   ...: )bacteria7:1.000,((A9RR03:1.000,O23404:1.000)embryophyta108:1.000,(Q4QGX9:1.000,(A2DFI7:1.000,A2EEC2:0.918,A2E4Y2:0.327,A2GQS3:0.083,(A8BQ26:1.000,B8C332:1.000)dtpgia19:1.000)dtpgtri29:1.000)dtpgtlei37:1.000)eukaryota88:1.000)cellular_organisms10
   ...: :1.000)root;"
t = parsenewick(polytreestr, NamedPolytomousTree)
for (node,nodename) in zip(nodeiter(t),nodenameiter(t))
    println(nodename)
end

This tree is different from the one I use, only by -1.000 value after cellular_organisms10, as there score cannot be computed, so I used negatives for uninitialized nodes to mark them. Though it's not supported in Phylo, so I had to change it to positive, but even though parsing still fails. I am not sure why.

richardreeve commented 6 years ago

Hmm, that works fine for me once I take out the weird ...: formatting at the end of lines. So:

julia> polytreestr = "(((((Q89S90:1.000,((Q89S67:1.000,Q74FM6:1.000)geobra5:1.000,(Q8AA21:1.000,(Q8RFB7:1.000,A9WG08:1.000)chlofus8:1.000)cfbac9:1.000)proteobacteria7:0.252,Q7NH47:1.000)gpl1427:1.000,((O06579:1.000,Q9RI40:1.000,Q9L2C5:0.600)actinomycetales612:1.000,(Q9WYA9:1.000,(B5YL47:1.000,(O67899:1.000,(Q7UWJ7:1.000,B8E0E7:1.000)dicrho2:1.000)draqui482:1.000)drathe5:1.000)dratthema3:1.000)drattsmtub13:1.000)gamproleipse89:1.000,(Q8TT31:1.000,(Q97ZL2:1.000,B1L5D4:1.000)korsul405:1.000)archaea18:1.000)bacteria7:1.000,((A9RR03:1.000,O23404:1.000)embryophyta108:1.000,(Q4QGX9:1.000,(A2DFI7:1.000,A2EEC2:0.918,A2E4Y2:0.327,A2GQS3:0.083,(A8BQ26:1.000,B8C332:1.000)dtpgia19:1.000)dtpgtri29:1.000)dtpgtlei37:1.000)eukaryota88:1.000)cellular_organisms10:1.000)root;";

julia> t = parsenewick(polytreestr, NamedPolytomousTree)
Phylo.PolytomousTree{Phylo.LeafInfo,Void} phylogenetic tree with 49 nodes and 48 branches
Leaf names:
String["Q89S90", "Q89S67", "Q74FM6", "Q8AA21", "Q8RFB7", "A9WG08", "Q7NH47", "O06579", "Q9RI40", "Q9L2C5"  …  "B1L5D4", "A9RR03", "O23404", "Q4QGX9", "A2DFI7", "A2EEC2", "A2E4Y2", "A2GQS3", "A8BQ26", "B8C332"]

julia> for (node,nodename) in zip(nodeiter(t),nodenameiter(t))
           println(nodename)
       end
Q89S90
Q89S67
Q74FM6
geobra5
Q8AA21
Q8RFB7
A9WG08
chlofus8
cfbac9
proteobacteria7
Q7NH47
gpl1427
O06579
Q9RI40
Q9L2C5
actinomycetales612
Q9WYA9
B5YL47
O67899
Q7UWJ7
B8E0E7
dicrho2
draqui482
drathe5
dratthema3
drattsmtub13
gamproleipse89
Q8TT31
Q97ZL2
B1L5D4
korsul405
archaea18
bacteria7
A9RR03
O23404
embryophyta108
Q4QGX9
A2DFI7
A2EEC2
A2E4Y2
A2GQS3
A8BQ26
B8C332
dtpgia19
dtpgtri29
dtpgtlei37
eukaryota88
cellular_organisms10
root

julia> collect(branchiter(t))
48-element Array{Phylo.Branch{String},1}:
 [node "geobra5"]-->[1.0 length branch]-->[node "Q74FM6"]
 [node "gpl1427"]-->[1.0 length branch]-->[node "Q7NH47"]
 [node "dtpgtri29"]-->[0.327 length branch]-->[node "A2E4Y2"]
 [node "cellular_organisms10"]-->[1.0 length branch]-->[node "bacteria7"]
 [node "gamproleipse89"]-->[1.0 length branch]-->[node "gpl1427"]
 [node "dtpgtlei37"]-->[1.0 length branch]-->[node "Q4QGX9"]
 [node "archaea18"]-->[1.0 length branch]-->[node "Q8TT31"]
 [node "proteobacteria7"]-->[1.0 length branch]-->[node "cfbac9"]
 [node "drathe5"]-->[1.0 length branch]-->[node "draqui482"]
 [node "actinomycetales612"]-->[0.6 length branch]-->[node "Q9L2C5"]
 [node "bacteria7"]-->[1.0 length branch]-->[node "gamproleipse89"]
 [node "embryophyta108"]-->[1.0 length branch]-->[node "A9RR03"]
 [node "draqui482"]-->[1.0 length branch]-->[node "dicrho2"]
 [node "gamproleipse89"]-->[1.0 length branch]-->[node "drattsmtub13"]
 [node "dtpgia19"]-->[1.0 length branch]-->[node "A8BQ26"]
 [node "draqui482"]-->[1.0 length branch]-->[node "O67899"]
 ⋮
 [node "dtpgia19"]-->[1.0 length branch]-->[node "B8C332"]
 [node "root"]-->[1.0 length branch]-->[node "cellular_organisms10"]
 [node "actinomycetales612"]-->[1.0 length branch]-->[node "O06579"]
 [node "dicrho2"]-->[1.0 length branch]-->[node "B8E0E7"]
 [node "dratthema3"]-->[1.0 length branch]-->[node "Q9WYA9"]
 [node "gpl1427"]-->[0.252 length branch]-->[node "proteobacteria7"]
 [node "drathe5"]-->[1.0 length branch]-->[node "B5YL47"]
 [node "dratthema3"]-->[1.0 length branch]-->[node "drathe5"]
 [node "cfbac9"]-->[1.0 length branch]-->[node "chlofus8"]
 [node "drattsmtub13"]-->[1.0 length branch]-->[node "dratthema3"]
 [node "korsul405"]-->[1.0 length branch]-->[node "B1L5D4"]
 [node "cfbac9"]-->[1.0 length branch]-->[node "Q8AA21"]
 [node "drattsmtub13"]-->[1.0 length branch]-->[node "actinomycetales612"]
 [node "korsul405"]-->[1.0 length branch]-->[node "Q97ZL2"]
 [node "dtpgtri29"]-->[1.0 length branch]-->[node "dtpgia19"]
 [node "dicrho2"]-->[1.0 length branch]-->[node "Q7UWJ7"]

julia> collect(zip(nodenameiter(t), nodeiter(t)))
49-element Array{Tuple{Any,Any},1}:
 ("Q89S90", [branch 9]-->[leaf node])
 ("Q89S67", [branch 1]-->[leaf node])
 ("Q74FM6", [branch 2]-->[leaf node])
 ("geobra5", [branch 7]-->[internal node]-->[branches 1 and 2])
 ("Q8AA21", [branch 5]-->[leaf node])
 ("Q8RFB7", [branch 3]-->[leaf node])
 ("A9WG08", [branch 4]-->[leaf node])
 ("chlofus8", [branch 6]-->[internal node]-->[branches 3 and 4])
 ("cfbac9", [branch 8]-->[internal node]-->[branches 5 and 6])
 ("proteobacteria7", [branch 10]-->[internal node]-->[branches 7 and 8])
 ("Q7NH47", [branch 11]-->[leaf node])
 ("gpl1427", [branch 25]-->[internal node]-->[branches 9, 10 and 11])
 ("O06579", [branch 12]-->[leaf node])
 ("Q9RI40", [branch 13]-->[leaf node])
 ("Q9L2C5", [branch 14]-->[leaf node])
 ("actinomycetales612", [branch 23]-->[internal node]-->[branches 12, 13 and 14])
 ⋮
 ("A9RR03", [branch 33]-->[leaf node])
 ("O23404", [branch 34]-->[leaf node])
 ("embryophyta108", [branch 44]-->[internal node]-->[branches 33 and 34])
 ("Q4QGX9", [branch 42]-->[leaf node])
 ("A2DFI7", [branch 37]-->[leaf node])
 ("A2EEC2", [branch 38]-->[leaf node])
 ("A2E4Y2", [branch 39]-->[leaf node])
 ("A2GQS3", [branch 40]-->[leaf node])
 ("A8BQ26", [branch 35]-->[leaf node])
 ("B8C332", [branch 36]-->[leaf node])
 ("dtpgia19", [branch 41]-->[internal node]-->[branches 35 and 36])
 ("dtpgtri29", [branch 43]-->[internal node]-->[branches 37, 38, 39, 40 and 41])
 ("dtpgtlei37", [branch 45]-->[internal node]-->[branches 42 and 43])
 ("eukaryota88", [branch 47]-->[internal node]-->[branches 44 and 45])
 ("cellular_organisms10", [branch 48]-->[internal node]-->[branches 46 and 47])
 ("root", [root node]-->[branch 48])

Are you sure the problem isn't just how you're copying and pasting the string?

mashu commented 6 years ago

Sorry, you are right, it was copy/past error. I thought it's a part of Atom IDE, that it wraps the lines.

Because I am very new to Julia language, I was not sure how to read values associated with nodes. At least I haven't seen this in doc. Example would be awesome.

Also getchildren() function requires name of the node. Isn't it sufficient to provide the node itself and see the children directly ? or am I calling wrong function ?

richardreeve commented 6 years ago

Because I am very new to Julia language, I was not sure how to read values associated with nodes. At least I haven't seen this in doc. Example would be awesome.

You don't actually have any information associated with nodes at the moment - the only information in the newick tree is associated with the branches, which you access with getlength().

Also getchildren() function requires name of the node. Isn't it sufficient to provide the node itself and see the children directly ? or am I calling wrong function ?

The node only contains the names of its connections, so you can't actually retrieve the new nodes directly from that, you need the whole tree for context. I'm thinking about putting more information into the node types, but for the time being you need the tree and the name.

Thanks for your all of your comments by the way - they have given me some thoughts about how to improve the Phylo code, but in the meanwhile I'll merge this polytomies branch if it's useful for you?

mashu commented 6 years ago

Thanks. I think polytomies branch would be useful to play with, without it, I can't even load trees I am interested in. Though this is just for fun project, to speed up some parsing and learn Julia by doing in spare time.

richardreeve commented 6 years ago

Well, I've merged it, but I'm now working on something more general. We'll have to see how long it takes to get it finished.

richardreeve commented 4 years ago

I think this is all completed now.