Closed mashu closed 4 years ago
Okay, great, thanks. My thoughts:
Sometimes gene genealogy cannot be fully determined from the data, so it is not uncommon to have multi-furactions in gene trees. It seems package does not support them ? Would be nice to have that. Shouldn't be a big problem - I have to specifically check for binary trees at the moment, because that's what I needed, but allowing non-binary trees should be possible
Newick format in it's basic form can vary depending if support or distance values are provided, names for internal nodes, or it can be just a topological tree. I link ETE3 documentation page which shows table with example trees. Would be nice to support these various formats. Can you tell me which I don't currently support? I thought it could handle all of these, but I haven't checked... it doesn't record support information (or anything else though apart from branch length), it just discards it - that's the hardest thing to fix
Different strategies for traversing and exploring
nodenameiter()
to iterate over node names, nodeiter()
to iterate over nodes, likewise branch...()
for branches and ...filter()
(instead of ...iter()
) to iterate over the same but while filtering (e.g. for leaves)getchildren()
or getparent()
(or possibly getancestors()
and getdescendants()
), but I don't really know why people would want to do it in general...nodeiter()
and nodenameiter()
will guarantee to use the same ordering, so you could zip()
the iterators.getparent()
getchildren()
isleaf()
and isroot()
, can be combined with e.g. nodenameiter(tree)
to make nodenamefilter(isleaf, tree)
to make a filtering iteratornodenameiter()
generates a name iteratordistance()
gives you the distance between two nodes, is this what you mean?Do let me know if any of this isn't what you mean...
Regarding
Node 1.000000 already present in tree
from ete3 import Tree
print(Tree("((B:0.723274,B:0.567784)E:0.067192,(B:0.279326,B:0.756049)B:0.807788);",format=1).get_ascii(show_internal=True))
Why people want to iterate tree ?
Why iteration and not a recursion, I am not a programmer, but I remember from somewhere that every recursion can be turned into iteration (not simple) and iterations are more efficient.
Ah, okay. I used Wikipedia to determine newick format in fact (also described here) - it even provides a set of grammar rules, which my code follows (though at the moment only for binary trees, and with an extension to allow metadata - including support values - to be stored inside [&...]
as per nexus trees). Do you believe those rules are wrong? The ETE parser that you reference breaks those rules by allowing a number after )
to refer to a support value, and a string to refer to an internal node name. This seems to mean that you need to be incredibly careful in naming nodes as there will be misparsing if a name only includes numbers. The newick format within nexus tree format also frequently uses numbers as nodenames, which would be incompatible with this ETE-variant newick format. This seems like a poor breakage of compatibility between the two supposedly identical formats, and I'm reluctant to change things to use it when the nexus variant allows any arbitrary metadata (including support values) to be included with any node or branch... do have a strong reason for preferring ETE's (limited, incompatible) version?
I also have to admit I'm not remotely keen on duplicate node names - the names are used to store and retrieve the nodes - to name them in fact! - so having duplicate names would require a shadow naming scheme that was unique, which rather defeats the purpose of names in the first place. It also makes retrieval of a named node from a tree something that may return one or many nodes, which doesn't seem like a good idea either - I'm not sure why that would be desirable, do you?
I see your points and I do agree with most of your comments with small exception.
The ETE parser that you reference breaks those rules by allowing a number after ) to refer to a support value, and a string to refer to an internal node name.
I am surprised that naming internal nodes is not defined by standard Newick format. Unfortunately, my application requires names for internal nodes.
It also makes retrieval of a named node from a tree something that may return one or many nodes The branching point in the tree is also a node, so it could be returned as one node that has the property of having children. I think in both cases you would return one node.
I agree that not unique names, are not a good idea, in fact I can't think of a reason when that would be useful, nodes in my application are unique. But I do require naming internal nodes. If internal node naming works, and trees allow multifurcation I would be happy :+1:
Big thanks, for nice package and fast response !
Sorry - I think my response was confusing. Naming internal nodes is no problem in my code or the standard reference. The problem is in the ETE standard, where if an internal node name is a number it will be interpreted as a support value instead of as a name (as I understand what I'm reading there?!).
@mashu - do you want to try the polytomies
branch in #7 to see if it does what you need? It's rough and ready, but you should be able to parse a tree with polytomous branching using parsenewick(string, NamedPolytomousTree)
and open(io -> parsenewick(io, NamedPolytomousTree), file)
.
It works with tree I make by hand, but does not work with my trees, for example this one and fails on parsing
polytreestr = "(((((Q89S90:1.000,((Q89S67:1.000,Q74FM6:1.000)geobra5:1.000,(Q8AA21:1.000,(Q8RFB7:1.000,A9WG08:1.000)chlofus8:1.000)cfbac9:1.000)proteobacteria7:0.252,Q7NH47:1.000)gpl1427:1.000,((O06579:1.000,Q9RI40:1.000,Q9L2C5:0.600)actinomycetales612:1.
...: 000,(Q9WYA9:1.000,(B5YL47:1.000,(O67899:1.000,(Q7UWJ7:1.000,B8E0E7:1.000)dicrho2:1.000)draqui482:1.000)drathe5:1.000)dratthema3:1.000)drattsmtub13:1.000)gamproleipse89:1.000,(Q8TT31:1.000,(Q97ZL2:1.000,B1L5D4:1.000)korsul405:1.000)archaea18:1.000
...: )bacteria7:1.000,((A9RR03:1.000,O23404:1.000)embryophyta108:1.000,(Q4QGX9:1.000,(A2DFI7:1.000,A2EEC2:0.918,A2E4Y2:0.327,A2GQS3:0.083,(A8BQ26:1.000,B8C332:1.000)dtpgia19:1.000)dtpgtri29:1.000)dtpgtlei37:1.000)eukaryota88:1.000)cellular_organisms10
...: :1.000)root;"
t = parsenewick(polytreestr, NamedPolytomousTree)
for (node,nodename) in zip(nodeiter(t),nodenameiter(t))
println(nodename)
end
This tree is different from the one I use, only by -1.000 value after cellular_organisms10, as there score cannot be computed, so I used negatives for uninitialized nodes to mark them. Though it's not supported in Phylo, so I had to change it to positive, but even though parsing still fails. I am not sure why.
Hmm, that works fine for me once I take out the weird ...:
formatting at the end of lines. So:
julia> polytreestr = "(((((Q89S90:1.000,((Q89S67:1.000,Q74FM6:1.000)geobra5:1.000,(Q8AA21:1.000,(Q8RFB7:1.000,A9WG08:1.000)chlofus8:1.000)cfbac9:1.000)proteobacteria7:0.252,Q7NH47:1.000)gpl1427:1.000,((O06579:1.000,Q9RI40:1.000,Q9L2C5:0.600)actinomycetales612:1.000,(Q9WYA9:1.000,(B5YL47:1.000,(O67899:1.000,(Q7UWJ7:1.000,B8E0E7:1.000)dicrho2:1.000)draqui482:1.000)drathe5:1.000)dratthema3:1.000)drattsmtub13:1.000)gamproleipse89:1.000,(Q8TT31:1.000,(Q97ZL2:1.000,B1L5D4:1.000)korsul405:1.000)archaea18:1.000)bacteria7:1.000,((A9RR03:1.000,O23404:1.000)embryophyta108:1.000,(Q4QGX9:1.000,(A2DFI7:1.000,A2EEC2:0.918,A2E4Y2:0.327,A2GQS3:0.083,(A8BQ26:1.000,B8C332:1.000)dtpgia19:1.000)dtpgtri29:1.000)dtpgtlei37:1.000)eukaryota88:1.000)cellular_organisms10:1.000)root;";
julia> t = parsenewick(polytreestr, NamedPolytomousTree)
Phylo.PolytomousTree{Phylo.LeafInfo,Void} phylogenetic tree with 49 nodes and 48 branches
Leaf names:
String["Q89S90", "Q89S67", "Q74FM6", "Q8AA21", "Q8RFB7", "A9WG08", "Q7NH47", "O06579", "Q9RI40", "Q9L2C5" … "B1L5D4", "A9RR03", "O23404", "Q4QGX9", "A2DFI7", "A2EEC2", "A2E4Y2", "A2GQS3", "A8BQ26", "B8C332"]
julia> for (node,nodename) in zip(nodeiter(t),nodenameiter(t))
println(nodename)
end
Q89S90
Q89S67
Q74FM6
geobra5
Q8AA21
Q8RFB7
A9WG08
chlofus8
cfbac9
proteobacteria7
Q7NH47
gpl1427
O06579
Q9RI40
Q9L2C5
actinomycetales612
Q9WYA9
B5YL47
O67899
Q7UWJ7
B8E0E7
dicrho2
draqui482
drathe5
dratthema3
drattsmtub13
gamproleipse89
Q8TT31
Q97ZL2
B1L5D4
korsul405
archaea18
bacteria7
A9RR03
O23404
embryophyta108
Q4QGX9
A2DFI7
A2EEC2
A2E4Y2
A2GQS3
A8BQ26
B8C332
dtpgia19
dtpgtri29
dtpgtlei37
eukaryota88
cellular_organisms10
root
julia> collect(branchiter(t))
48-element Array{Phylo.Branch{String},1}:
[node "geobra5"]-->[1.0 length branch]-->[node "Q74FM6"]
[node "gpl1427"]-->[1.0 length branch]-->[node "Q7NH47"]
[node "dtpgtri29"]-->[0.327 length branch]-->[node "A2E4Y2"]
[node "cellular_organisms10"]-->[1.0 length branch]-->[node "bacteria7"]
[node "gamproleipse89"]-->[1.0 length branch]-->[node "gpl1427"]
[node "dtpgtlei37"]-->[1.0 length branch]-->[node "Q4QGX9"]
[node "archaea18"]-->[1.0 length branch]-->[node "Q8TT31"]
[node "proteobacteria7"]-->[1.0 length branch]-->[node "cfbac9"]
[node "drathe5"]-->[1.0 length branch]-->[node "draqui482"]
[node "actinomycetales612"]-->[0.6 length branch]-->[node "Q9L2C5"]
[node "bacteria7"]-->[1.0 length branch]-->[node "gamproleipse89"]
[node "embryophyta108"]-->[1.0 length branch]-->[node "A9RR03"]
[node "draqui482"]-->[1.0 length branch]-->[node "dicrho2"]
[node "gamproleipse89"]-->[1.0 length branch]-->[node "drattsmtub13"]
[node "dtpgia19"]-->[1.0 length branch]-->[node "A8BQ26"]
[node "draqui482"]-->[1.0 length branch]-->[node "O67899"]
⋮
[node "dtpgia19"]-->[1.0 length branch]-->[node "B8C332"]
[node "root"]-->[1.0 length branch]-->[node "cellular_organisms10"]
[node "actinomycetales612"]-->[1.0 length branch]-->[node "O06579"]
[node "dicrho2"]-->[1.0 length branch]-->[node "B8E0E7"]
[node "dratthema3"]-->[1.0 length branch]-->[node "Q9WYA9"]
[node "gpl1427"]-->[0.252 length branch]-->[node "proteobacteria7"]
[node "drathe5"]-->[1.0 length branch]-->[node "B5YL47"]
[node "dratthema3"]-->[1.0 length branch]-->[node "drathe5"]
[node "cfbac9"]-->[1.0 length branch]-->[node "chlofus8"]
[node "drattsmtub13"]-->[1.0 length branch]-->[node "dratthema3"]
[node "korsul405"]-->[1.0 length branch]-->[node "B1L5D4"]
[node "cfbac9"]-->[1.0 length branch]-->[node "Q8AA21"]
[node "drattsmtub13"]-->[1.0 length branch]-->[node "actinomycetales612"]
[node "korsul405"]-->[1.0 length branch]-->[node "Q97ZL2"]
[node "dtpgtri29"]-->[1.0 length branch]-->[node "dtpgia19"]
[node "dicrho2"]-->[1.0 length branch]-->[node "Q7UWJ7"]
julia> collect(zip(nodenameiter(t), nodeiter(t)))
49-element Array{Tuple{Any,Any},1}:
("Q89S90", [branch 9]-->[leaf node])
("Q89S67", [branch 1]-->[leaf node])
("Q74FM6", [branch 2]-->[leaf node])
("geobra5", [branch 7]-->[internal node]-->[branches 1 and 2])
("Q8AA21", [branch 5]-->[leaf node])
("Q8RFB7", [branch 3]-->[leaf node])
("A9WG08", [branch 4]-->[leaf node])
("chlofus8", [branch 6]-->[internal node]-->[branches 3 and 4])
("cfbac9", [branch 8]-->[internal node]-->[branches 5 and 6])
("proteobacteria7", [branch 10]-->[internal node]-->[branches 7 and 8])
("Q7NH47", [branch 11]-->[leaf node])
("gpl1427", [branch 25]-->[internal node]-->[branches 9, 10 and 11])
("O06579", [branch 12]-->[leaf node])
("Q9RI40", [branch 13]-->[leaf node])
("Q9L2C5", [branch 14]-->[leaf node])
("actinomycetales612", [branch 23]-->[internal node]-->[branches 12, 13 and 14])
⋮
("A9RR03", [branch 33]-->[leaf node])
("O23404", [branch 34]-->[leaf node])
("embryophyta108", [branch 44]-->[internal node]-->[branches 33 and 34])
("Q4QGX9", [branch 42]-->[leaf node])
("A2DFI7", [branch 37]-->[leaf node])
("A2EEC2", [branch 38]-->[leaf node])
("A2E4Y2", [branch 39]-->[leaf node])
("A2GQS3", [branch 40]-->[leaf node])
("A8BQ26", [branch 35]-->[leaf node])
("B8C332", [branch 36]-->[leaf node])
("dtpgia19", [branch 41]-->[internal node]-->[branches 35 and 36])
("dtpgtri29", [branch 43]-->[internal node]-->[branches 37, 38, 39, 40 and 41])
("dtpgtlei37", [branch 45]-->[internal node]-->[branches 42 and 43])
("eukaryota88", [branch 47]-->[internal node]-->[branches 44 and 45])
("cellular_organisms10", [branch 48]-->[internal node]-->[branches 46 and 47])
("root", [root node]-->[branch 48])
Are you sure the problem isn't just how you're copying and pasting the string?
Sorry, you are right, it was copy/past error. I thought it's a part of Atom IDE, that it wraps the lines.
Because I am very new to Julia language, I was not sure how to read values associated with nodes. At least I haven't seen this in doc. Example would be awesome.
Also getchildren() function requires name of the node. Isn't it sufficient to provide the node itself and see the children directly ? or am I calling wrong function ?
Because I am very new to Julia language, I was not sure how to read values associated with nodes. At least I haven't seen this in doc. Example would be awesome.
You don't actually have any information associated with nodes at the moment - the only information in the newick tree is associated with the branches, which you access with getlength()
.
Also getchildren() function requires name of the node. Isn't it sufficient to provide the node itself and see the children directly ? or am I calling wrong function ?
The node only contains the names of its connections, so you can't actually retrieve the new nodes directly from that, you need the whole tree for context. I'm thinking about putting more information into the node types, but for the time being you need the tree and the name.
Thanks for your all of your comments by the way - they have given me some thoughts about how to improve the Phylo code, but in the meanwhile I'll merge this polytomies
branch if it's useful for you?
Thanks. I think polytomies branch would be useful to play with, without it, I can't even load trees I am interested in. Though this is just for fun project, to speed up some parsing and learn Julia by doing in spare time.
Well, I've merged it, but I'm now working on something more general. We'll have to see how long it takes to get it finished.
I think this is all completed now.
Hi,
Newick format in it's basic form can vary depending if support or distance values are provided, names for internal nodes, or it can be just a topological tree. I link ETE3 documentation page which shows table with example trees. Would be nice to support these various formats.
Different strategies for traversing and exploring