OpenTreeOfLife / treemachine

Source tree graph database
Other
16 stars 6 forks source link

synthesis issue: only using mrca taxa from an edge ("relationship" in neo4j lingo) rather than node #157

Closed mtholder closed 8 years ago

mtholder commented 9 years ago

I just pushed https://github.com/OpenTreeOfLife/treemachine/tree/issue-157 with some very simple examples to help debug https://github.com/OpenTreeOfLife/treemachine/issues/156

But I'm not getting the tree that one would hope for when I run:

$ run-simpleprob-example-diff-order-synth.sh

and look a the result in simpleprob-diff-order-out/synth.tre

The inputs have not conflict and there is no evidence of non-monophyly of the taxa. So I would hope that the synthesis would return the tree in simpleprob/expected.tre

There is a (very) good chance that I'm just running treemachine incorrectly. If @blackrim @josephwb or @chinchliff could take a look at run-simpleprob-example-diff-order-synth.sh I would really appreciate it.

I pulled and built jade and ot-base before pulling and building treemachine.

mtholder commented 9 years ago

or I could have messed up my inputs (in simpleprob dir). But I don't see the error...

mtholder commented 9 years ago

To explain this test case...

The only thing that the taxonomy says is that there is a genus E. The taxonomic tree is (A,B,C,D,(E1,E2)E)

The first tree is (((A,B),C),D) this is input tree1.tre and it is the tree that I (tried to) give highest priority to in run-simpleprob-example-diff-order-synth.sh. Clearly this tree says nothing about E

The second tree (which is in tree2.tre) is ((A,E1),B). The only information here is that E is closer to A than it either is to B.

Taken together the inputs imply the supertree ((((A,E1),B),C),D) With the expansion based on taxonomy, one would hope that the synthetic tree would be ((((A,(E1,E2)),B),C),D). This tree is in expected.tre.

What is returned is ((E2,E1)E,D,(C,(B,A))) this is synth.tre in the output of the script.

Hopefully, I am just invoking treemachine incorrectly.

mtholder commented 9 years ago

Outputs of each treemachine step are stored in simpleprob-diff-order for the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the odd tree is returned because the synthesis starts at the root node. Based on that it decides that the 3 branches emanating from the root should be:

  1. D this makes sense
  2. the mrca node 11 which includes both STREECHILDOF from both tree1 and tree2. This node is the ancestor of A, B, C, and E. E is included because there appears to be a bumping of E1 in tree2 up to E. I base that on:

    attempting to remap tips to deepest exemplified taxa E1 was remapped to 'E' (id=7)

    from simpleprob-diff-order/addnewick-tree2.log.

  3. the node representing E. This is odd to me. This is the acceptance of relationship 17 during synthesis:

    testing rel 17 for conflicts E has 2 terminal descendants ++rel 17 passed, it will be added

    in the synthesize.log. It would seem to me that the acceptance of rel 33 (leading from node 11) would prevent the attachement of E here. because there is taxonomic overlap with the MRCA set in node 11 and E. Perhaps this is a result of the fact that the descendant E is added to the node by the second tree (not the tree that created the node). I suspect that it is because the synthesis traversal is breadth first and somehow not paying attention to the fact that one of the nodes that was just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one would like to see in the final tree) is rejected as a dauther of node 10, it appears that node 13 is rejected because synthesis has just moments earlier added A as a child of node 10:

testing rel 45 for conflicts
Node[13] has 2 terminal descendants
        conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10 edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1 could be satisfied by coming in from node 13 or coming in directly. So tree2 and tree1 say compatible things about the descendants of node 10. One would hope that while recognizing that tree1 should win when it conflicts with tree2, the synthesis would recognize that there is not conflict here -- at least in cases like this one when the high ranking edge is trivial.

To summarize:

  1. I don't see why we get "++rel 17 passed, it will be added"
  2. I don't understand how we can select node 11 to be included, but then select a set of edges for its children that omit some of the leaves listed in its MRCA. I suspect that this is what is causing issue https://github.com/OpenTreeOfLife/treemachine/issues/156. It is a violation of what I listed as principle #4:

    "#4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups. "

    I thought was true of synthesis - but it apparently isn't.

  3. I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a silly error in my inputs.

update "added A as a child of node 11" correct to "added A as a child of node 10" . thanks @kcranston

blackrim commented 9 years ago

I think it would be good if Cody also piped in because he knows a good amount about the synth algorithm. Nevertheless, I don't disagree that your example presents an interesting problem for synth. Basically coming out of the root there are these potential nodes for synth

11 - from tree1 13 - from tree2 and then the individual taxonomy ones as you mention, it goes breadth first and so E is added here also, it checks the relationship taxa not the node (that is why E is not seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am going to mess around a little on this branch and see if adding the node instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder notifications@github.com wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order for the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the odd tree is returned because the synthesis starts at the root node. Based on that it decides that the 3 branches emanating from the root should be:

1.

D this makes sense 2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and tree2. This node is the ancestor of A, B, C, and E. E is included because there appears to be a bumping of E1 in tree2 up to E. I base that on:

attempting to remap tips to deepest exemplified taxa E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log. 3.

the node representing E. This is odd to me. This is the acceptance of relationship 17 during synthesis:

testing rel 17 for conflicts E has 2 terminal descendants ++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel 33 (leading from node 11) would prevent the attachement of E here. because there is taxonomic overlap with the MRCA set in node 11 and E. Perhaps this is a result of the fact that the descendant E is added to the node by the second tree (not the tree that created the node). I suspect that it is because the synthesis traversal is breadth first and somehow not paying attention to the fact that one of the nodes that was just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one would like to see in the final tree) is rejected as a dauther of node 10, it appears that node 13 is rejected because synthesis has just moments earlier added A as a child of node 11:

testing rel 45 for conflicts Node[13] has 2 terminal descendants conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10 edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1 could be satisfied by coming in from node 13 or coming in directly. So tree2 and tree1 say compatible things about the descendants of node 10. One would hope that while recognizing that tree1 should win when it conflicts with tree2, the synthesis would recognize that there is not conflict here -- at least in cases like this one when the high ranking edge is trivial.

To summarize:

1.

I don't see why we get "++rel 17 passed, it will be added" 2.

I don't understand how we can select node 11 to be included, but then select a set of edges for its children that omit some of the leaves listed in its MRCA. I suspect that this is what is causing issue #156 https://github.com/OpenTreeOfLife/treemachine/issues/156. It is a violation of what I listed as principle #4 https://github.com/OpenTreeOfLife/treemachine/issues/4:

"#4 https://github.com/OpenTreeOfLife/treemachine/issues/4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't. 3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a silly error in my inputs.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72325365 .

blackrim commented 9 years ago

Confirmed that the reason E is still added is because of the relationsihp mrca being checked (from tree 1 ) instead of the node. That isn't necessarily an error and we would need to bring Cody in on this (he wrote that code : RankResolutionMethod and the other resolution methods). when you relax that however, it doesn't give the tree you want anyway because it gets to node 10 and goes with tree 1 which has just a sister to b instead of going to node 13 which is a sister to e. you want going to b then node 13.

I am going to draw this out more but I see your point. It is however following the paths, it is just that an improved method would perhaps follow your route. this would simply be another resolutionmethod and could perhaps be mocked up if we could get the rules down.

On Sat, Jan 31, 2015 at 12:24 PM, Stephen Smith blackrim@gmail.com wrote:

I think it would be good if Cody also piped in because he knows a good amount about the synth algorithm. Nevertheless, I don't disagree that your example presents an interesting problem for synth. Basically coming out of the root there are these potential nodes for synth

11 - from tree1 13 - from tree2 and then the individual taxonomy ones as you mention, it goes breadth first and so E is added here also, it checks the relationship taxa not the node (that is why E is not seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am going to mess around a little on this branch and see if adding the node instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder <notifications@github.com

wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order for the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the odd tree is returned because the synthesis starts at the root node. Based on that it decides that the 3 branches emanating from the root should be:

1.

D this makes sense 2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and tree2. This node is the ancestor of A, B, C, and E. E is included because there appears to be a bumping of E1 in tree2 up to E. I base that on:

attempting to remap tips to deepest exemplified taxa E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log. 3.

the node representing E. This is odd to me. This is the acceptance of relationship 17 during synthesis:

testing rel 17 for conflicts E has 2 terminal descendants ++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel 33 (leading from node 11) would prevent the attachement of E here. because there is taxonomic overlap with the MRCA set in node 11 and E. Perhaps this is a result of the fact that the descendant E is added to the node by the second tree (not the tree that created the node). I suspect that it is because the synthesis traversal is breadth first and somehow not paying attention to the fact that one of the nodes that was just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one would like to see in the final tree) is rejected as a dauther of node 10, it appears that node 13 is rejected because synthesis has just moments earlier added A as a child of node 11:

testing rel 45 for conflicts Node[13] has 2 terminal descendants conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10 edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1 could be satisfied by coming in from node 13 or coming in directly. So tree2 and tree1 say compatible things about the descendants of node 10. One would hope that while recognizing that tree1 should win when it conflicts with tree2, the synthesis would recognize that there is not conflict here -- at least in cases like this one when the high ranking edge is trivial.

To summarize:

1.

I don't see why we get "++rel 17 passed, it will be added" 2.

I don't understand how we can select node 11 to be included, but then select a set of edges for its children that omit some of the leaves listed in its MRCA. I suspect that this is what is causing issue #156 https://github.com/OpenTreeOfLife/treemachine/issues/156. It is a violation of what I listed as principle #4 https://github.com/OpenTreeOfLife/treemachine/issues/4:

"#4 https://github.com/OpenTreeOfLife/treemachine/issues/4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't. 3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a silly error in my inputs.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72325365 .

chinchliff commented 9 years ago

Hi Mark. I agree, those sound like some issues. The current synthesis method is very simplistic and clearly has some room for improvement. Some of the problems you mentioned may have solutions within the context of the current method, but it might make more sense to try and think about designing a better way to do synthesis overall.

The code is designed to be modular so we can easily swap out a different method, basically for this exact reason. I am sure there are better algorithms than the current one. This would be something good to talk about. We played with a method a while ago, for instance, that attempted to calculate best paths (instead of just making simple decisions at nodes) but ran into scalability problems and dropped the ball on that one.

I can look more into this example but it sounds like he tween you and Stephen this example seems pretty well worked out?

On Saturday, January 31, 2015, Stephen Smith notifications@github.com wrote:

Confirmed that the reason E is still added is because of the relationsihp mrca being checked (from tree 1 ) instead of the node. That isn't necessarily an error and we would need to bring Cody in on this (he wrote that code : RankResolutionMethod and the other resolution methods). when you relax that however, it doesn't give the tree you want anyway because it gets to node 10 and goes with tree 1 which has just a sister to b instead of going to node 13 which is a sister to e. you want going to b then node 13.

I am going to draw this out more but I see your point. It is however following the paths, it is just that an improved method would perhaps follow your route. this would simply be another resolutionmethod and could perhaps be mocked up if we could get the rules down.

On Sat, Jan 31, 2015 at 12:24 PM, Stephen Smith <blackrim@gmail.com javascript:_e(%7B%7D,'cvml','blackrim@gmail.com');> wrote:

I think it would be good if Cody also piped in because he knows a good amount about the synth algorithm. Nevertheless, I don't disagree that your example presents an interesting problem for synth. Basically coming out of the root there are these potential nodes for synth

11 - from tree1 13 - from tree2 and then the individual taxonomy ones as you mention, it goes breadth first and so E is added here also, it checks the relationship taxa not the node (that is why E is not seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am going to mess around a little on this branch and see if adding the node instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');

wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order for the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the odd tree is returned because the synthesis starts at the root node. Based on that it decides that the 3 branches emanating from the root should be:

1.

D this makes sense 2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and tree2. This node is the ancestor of A, B, C, and E. E is included because there appears to be a bumping of E1 in tree2 up to E. I base that on:

attempting to remap tips to deepest exemplified taxa E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log. 3.

the node representing E. This is odd to me. This is the acceptance of relationship 17 during synthesis:

testing rel 17 for conflicts E has 2 terminal descendants ++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel 33 (leading from node 11) would prevent the attachement of E here. because there is taxonomic overlap with the MRCA set in node 11 and E. Perhaps this is a result of the fact that the descendant E is added to the node by the second tree (not the tree that created the node). I suspect that it is because the synthesis traversal is breadth first and somehow not paying attention to the fact that one of the nodes that was just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one would like to see in the final tree) is rejected as a dauther of node 10, it appears that node 13 is rejected because synthesis has just moments earlier added A as a child of node 11:

testing rel 45 for conflicts Node[13] has 2 terminal descendants conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10 edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1 could be satisfied by coming in from node 13 or coming in directly. So tree2 and tree1 say compatible things about the descendants of node 10. One would hope that while recognizing that tree1 should win when it conflicts with tree2, the synthesis would recognize that there is not conflict here -- at least in cases like this one when the high ranking edge is trivial.

To summarize:

1.

I don't see why we get "++rel 17 passed, it will be added" 2.

I don't understand how we can select node 11 to be included, but then select a set of edges for its children that omit some of the leaves listed in its MRCA. I suspect that this is what is causing issue #156 https://github.com/OpenTreeOfLife/treemachine/issues/156. It is a violation of what I listed as principle #4 https://github.com/OpenTreeOfLife/treemachine/issues/4:

"#4 https://github.com/OpenTreeOfLife/treemachine/issues/4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't. 3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a silly error in my inputs.

— Reply to this email directly or view it on GitHub < https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72325365>

.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72327726 .

blackrim commented 9 years ago

Yeah, I agree. I think this sounds like a new conflictresolution (the folder in the code) method that is simply less simplistic. We will need to work out what the rules are though (things like more taxa subtending doesn't work because of taxonomy nodes, perhaps more resolution subtending along with rank). Right now the rank only is just very very simple.

On Sat, Jan 31, 2015 at 1:41 PM, Cody Hinchliff notifications@github.com wrote:

Hi Mark. I agree, those sound like some issues. The current synthesis method is very simplistic and clearly has some room for improvement. Some of the problems you mentioned may have solutions within the context of the current method, but it might make more sense to try and think about designing a better way to do synthesis overall.

The code is designed to be modular so we can easily swap out a different method, basically for this exact reason. I am sure there are better algorithms than the current one. This would be something good to talk about. We played with a method a while ago, for instance, that attempted to calculate best paths (instead of just making simple decisions at nodes) but ran into scalability problems and dropped the ball on that one.

I can look more into this example but it sounds like he tween you and Stephen this example seems pretty well worked out?

On Saturday, January 31, 2015, Stephen Smith notifications@github.com wrote:

Confirmed that the reason E is still added is because of the relationsihp mrca being checked (from tree 1 ) instead of the node. That isn't necessarily an error and we would need to bring Cody in on this (he wrote that code : RankResolutionMethod and the other resolution methods). when you relax that however, it doesn't give the tree you want anyway because it gets to node 10 and goes with tree 1 which has just a sister to b instead of going to node 13 which is a sister to e. you want going to b then node 13.

I am going to draw this out more but I see your point. It is however following the paths, it is just that an improved method would perhaps follow your route. this would simply be another resolutionmethod and could perhaps be mocked up if we could get the rules down.

On Sat, Jan 31, 2015 at 12:24 PM, Stephen Smith <blackrim@gmail.com javascript:_e(%7B%7D,'cvml','blackrim@gmail.com');> wrote:

I think it would be good if Cody also piped in because he knows a good amount about the synth algorithm. Nevertheless, I don't disagree that your example presents an interesting problem for synth. Basically coming out of the root there are these potential nodes for synth

11 - from tree1 13 - from tree2 and then the individual taxonomy ones as you mention, it goes breadth first and so E is added here also, it checks the relationship taxa not the node (that is why E is not seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am going to mess around a little on this branch and see if adding the node instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');

wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order for the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the odd tree is returned because the synthesis starts at the root node. Based on that it decides that the 3 branches emanating from the root should be:

1.

D this makes sense 2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and tree2. This node is the ancestor of A, B, C, and E. E is included because there appears to be a bumping of E1 in tree2 up to E. I base that on:

attempting to remap tips to deepest exemplified taxa E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log. 3.

the node representing E. This is odd to me. This is the acceptance of relationship 17 during synthesis:

testing rel 17 for conflicts E has 2 terminal descendants ++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel 33 (leading from node 11) would prevent the attachement of E here. because there is taxonomic overlap with the MRCA set in node 11 and E. Perhaps this is a result of the fact that the descendant E is added to the node by the second tree (not the tree that created the node). I suspect that it is because the synthesis traversal is breadth first and somehow not paying attention to the fact that one of the nodes that was just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one would like to see in the final tree) is rejected as a dauther of node 10, it appears that node 13 is rejected because synthesis has just moments earlier added A as a child of node 11:

testing rel 45 for conflicts Node[13] has 2 terminal descendants conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10 edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1 could be satisfied by coming in from node 13 or coming in directly. So tree2 and tree1 say compatible things about the descendants of node 10. One would hope that while recognizing that tree1 should win when it conflicts with tree2, the synthesis would recognize that there is not conflict here -- at least in cases like this one when the high ranking edge is trivial.

To summarize:

1.

I don't see why we get "++rel 17 passed, it will be added" 2.

I don't understand how we can select node 11 to be included, but then select a set of edges for its children that omit some of the leaves listed in its MRCA. I suspect that this is what is causing issue #156 https://github.com/OpenTreeOfLife/treemachine/issues/156. It is a violation of what I listed as principle #4 https://github.com/OpenTreeOfLife/treemachine/issues/4:

"#4 https://github.com/OpenTreeOfLife/treemachine/issues/4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't. 3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a silly error in my inputs.

— Reply to this email directly or view it on GitHub <

https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72325365>

.

— Reply to this email directly or view it on GitHub < https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72327726>

.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72330073 .

kcranston commented 9 years ago

So, it seems that synthesis is working differently that at least some of us expect. I would have also expected ((((A,(E1,E2)),B),C),D) as the synthetic tree. It is probably ok that it works differently, but we need to make sure 1. we can clearly articulate what is happening and 2. that we aren't making statement in the paper or in the properties pane of the web application (i.e. tree X 'supports' this edge) that conflict with what is actually happening.

I would like to understand why we aren't getting the expected tree. Can we get a few diagrams of what is going on?

blackrim commented 9 years ago

Here is the diagram of what is going on (attached). It is basically traversing the blue relationships. It understands that the green are not in conflict, that isn't a problem. It just prefers the blue. This isn't an error in that sense. It is traversing the preferred tree in order of ranking. It could just be doing some of this smarter (as we have mentioned a number of times, improving synth methods is a goal). We have discussed solutions back when we were doing the developments, but as mentioned, it got sidetracked because of all the other issues that were coming to a head.

My lab has plans on discussing this next week and would love to have ideas if folks have any (basically rules). As Cody mentioned, implementing new methods of synth is quite easy (we have a few already in there) and so we could improve quickly if folks have ideas.

Take care

On Sat, Jan 31, 2015 at 4:07 PM, Karen Cranston notifications@github.com wrote:

So, it seems that synthesis is working differently that at least some of us expect. I would have also expected ((((A,(E1,E2)),B),C),D) as the synthetic tree. It is probably ok that it works differently, but we need to make sure 1. we can clearly articulate what is happening and 2. that we aren't making statement in the paper or in the properties pane of the web application (i.e. tree X 'supports' this edge) that conflict with what is actually happening.

I would like to understand why we aren't getting the expected tree. Can we get a few diagrams of what is going on?

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72336681 .

kcranston commented 9 years ago

I don't see the attachment?

blackrim commented 9 years ago

Hm, here it is again.

On Sat, Jan 31, 2015 at 9:49 PM, Karen Cranston notifications@github.com wrote:

I don't see the attachment?

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72348546 .

kcranston commented 9 years ago

I don't think it works to reply to the email notification with an attached file. You can drag and drop an image via GitHub, though.

kcranston commented 9 years ago

Pasting in the file from @blackrim example_tm

If ((E2,E1)E,D,(C,(B,A))) is the tree that results when the order of preference is blue (tree1) > green (tree2) > black (taxonomy), then I think we need to re-think how we are explaining the method in the paper. Would the edge from E to the root show in the browser as supported by taxonomy (which would be strange, given that we have tree2 with E) or would it show as supported by tree2 (which would be wrong, as tree2 doesn't have that relationship)?

blackrim commented 9 years ago

If would show as taxonomy. Not sure why the that would be strange though. On Jan 31, 2015 10:47 PM, "Karen Cranston" notifications@github.com wrote:

Pasting in the file from @blackrim https://github.com/blackrim [image: example_tm] https://cloud.githubusercontent.com/assets/312034/5990892/7e9f10e6-a997-11e4-83c4-f10aa639cc3a.png

If ((E2,E1)E,D,(C,(B,A))) is the tree that results when the order of preference is blue (tree1) > green (tree2) > black (taxonomy), then I think we need to re-think how we are explaining the method in the paper. Would the edge from E to the root show in the browser as supported by taxonomy (which would be strange, given that we have tree2 with E) or would it show as supported by tree2 (which would be wrong, as tree2 doesn't have that relationship)?

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72350037 .

mtholder commented 9 years ago

housekeeping notes: I changed the name of the issue to be more informative. I also just pushed a run-reltaxa-example.sh to the https://github.com/OpenTreeOfLife/treemachine/tree/issue-157 branch. It is the example from this thread pruning off E1, E2, and D from the examples (in tree2, E is used instead of E1). I think that we should still discuss the run-simpleprob-example-diff-order-synth.sh in this thread. but the new example may be useful if you're looking for the simplest case of this phenomenon.

mtholder commented 9 years ago

@blackrim can you tell me what tool you use to make that graph? - it is very helpful (fortunately I could steal my daughter's colored markers to make one of my own, but it was tedious.) Thanks also for pointing me to the RankResoloutionMethod. It would have taken me a long time to find that.

The TAG paper says: "At each node, the procedure examines the subtending nodes, and determines if any of them conflict. For synthesis, downstream conflict is determined by comparing the LICAs for each child. If the LICAs from nodes subtending the current node overlap, then these descendant subgraphs define incompatible subtrees, and are said to be in conflict..." (and then it goes on to point to some other notions of conflict that you discuss in the paper which are based on neighbor count)

So I found the statement that "it checks the relationship taxa not the node" above quite surprising. I'm not sure that I know what "relationship taxa" means. I'm assuming that it means the "the taxa in the descendant set of the child node that were present in the edge."

If that is correct, then I think that this is a serious issue because it seems to undermine the justification for believing that we can resolve downstream conflicts by looking at edges entering a node. I thought that the advantage of the TAG was to highlight were conflict was and let you decide what sorts of conflict could be dealt with later in the algorithm.

I'd like to be able to think that the STREECHILDOF edge from node11 to the root implies that, if we select node11 to be part of the synthetic tree, then all of it descendants will be found under node 11. If we use the only the "relationship taxa", then we allow a lower ranking source (the taxonomy in this case) to "win" because its preference accepted deeper in the tree.

This makes it inaccurate to characterize our synthesis as: detecting conflict and resolving those conflicts by preferring higher ranked sources. And it becomes really hard to explain to any biologist how the synthesis works - and how to make the tree better.

I think that I have a sense of some of the downsides to using the node MRCA (as I would prefer to do) instead of the "relationship taxa". I'll try to come up with a simple example of those so that we can discuss them.

mtholder commented 9 years ago

So, I commented out the filtering down to just the list of descendants from the relationship. I think this works at one stage, but that E is not added to the tree because of https://github.com/OpenTreeOfLife/treemachine/issues/158

E is subsequently grafted back on at the root of the tree. I haven't looked for the code that does that. The output from the synthesis is:

Results of conflict resolution:
(result reporting feature not yet implemented)
have to add 1
taxaleft: 1 working with E

I think that adding a bit of logic to my pseudocode in https://github.com/OpenTreeOfLife/treemachine/issues/158 may be able get around this. Basically, we need to make sure that we attach all of the descendants of a node somewhere below that node.

mtholder commented 9 years ago

I should have mentioned that I tested this change on the RankNodeDesResolutionMethod branch that I just pushed.

blackrim commented 9 years ago

Cool, well check out in a bit. Also, just noting here something I mentioned in the other issue. We constructed a branch and bound ( though really mostly branch) synth that also is discussed in the paper and deals with this issue nicely. It simply doesn't scale well with the many millions nodes. It is in the code and can be switched on however for the big tree perhaps your solution will work better. On Feb 1, 2015 6:16 AM, "Mark T. Holder" notifications@github.com wrote:

I should have mentioned that I tested this change on the RankNodeDesResolutionMethod branch that I just pushed.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72360731 .

blackrim commented 9 years ago

The branch (and bound) method that I mentioned that addresses this problem is in there both in the conflict resolution folder and explorer in graph synthesis function. I am tracking down the main so I can do your example and will push that. However it uses very old code and doesn't seem to have been brought up to some of the newer classes that Cody wrote. I don't think it is worth it to do that and instead probably best to write some new ones including of the pseudo code you had (if you haven't already done that) On Feb 1, 2015 6:55 AM, "Stephen Smith" blackrim@gmail.com wrote:

Cool, well check out in a bit. Also, just noting here something I mentioned in the other issue. We constructed a branch and bound ( though really mostly branch) synth that also is discussed in the paper and deals with this issue nicely. It simply doesn't scale well with the many millions nodes. It is in the code and can be switched on however for the big tree perhaps your solution will work better. On Feb 1, 2015 6:16 AM, "Mark T. Holder" notifications@github.com wrote:

I should have mentioned that I tested this change on the RankNodeDesResolutionMethod branch that I just pushed.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72360731 .

blackrim commented 9 years ago

yeah, so something was broken with the graphsynthesis method when things were moved to the new system that isn't letting me do the branch and bound without the justtrees bit (that was the bit in the paper). i will see if i can fix that real quick but it won't address doing this at a larger scale.

On Sun, Feb 1, 2015 at 7:41 AM, Stephen Smith blackrim@gmail.com wrote:

The branch (and bound) method that I mentioned that addresses this problem is in there both in the conflict resolution folder and explorer in graph synthesis function. I am tracking down the main so I can do your example and will push that. However it uses very old code and doesn't seem to have been brought up to some of the newer classes that Cody wrote. I don't think it is worth it to do that and instead probably best to write some new ones including of the pseudo code you had (if you haven't already done that) On Feb 1, 2015 6:55 AM, "Stephen Smith" blackrim@gmail.com wrote:

Cool, well check out in a bit. Also, just noting here something I mentioned in the other issue. We constructed a branch and bound ( though really mostly branch) synth that also is discussed in the paper and deals with this issue nicely. It simply doesn't scale well with the many millions nodes. It is in the code and can be switched on however for the big tree perhaps your solution will work better. On Feb 1, 2015 6:16 AM, "Mark T. Holder" notifications@github.com wrote:

I should have mentioned that I tested this change on the RankNodeDesResolutionMethod branch that I just pushed.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72360731 .

mtholder commented 9 years ago

some more formalized discussion of the effects of this "relationship taxa" conflict detection are at https://github.com/OpenTreeOfLife/treemachine/blob/nonsense-1/nonsense/sted_support_theorem.md

mtholder commented 9 years ago

As hoped, It looks like https://github.com/OpenTreeOfLife/treemachine/commit/54fcad1e2bd72d20986f586cbab14d8e04832165 also fixes this issue. Still need to write more tests to make sure this isn't breaking other logic.

chinchliff commented 9 years ago

The recent edits to the loading procedure and synthesis seem to solve this problem as well. Here is the tag, and the results of the 'simpleprob' test (which I moved into test-synth dir on rootward-synth branch).

FAILING_TREEMACHINE_TEST=simpleprob ./run_synth_tests.sh

# [snipped]

((((A,(E1,E2)),B),C),D);
clade extracted: E1, E2
clade extracted: A, B, C, E1, E2
clade extracted: A, B, C, D, E1, E2
clade extracted: A, E1, E2
clade extracted: A, B, E1, E2
(((((E2_ottE2,E1_ottE1)E_ottE,A_ottA),B_ottB),C_ottC),D_ottD)life_ott805080;
clade extracted: E1, E2
clade extracted: A, B, C, E1, E2
clade extracted: A, B, C, D, E1, E2
clade extracted: A, E1, E2
clade extracted: A, B, E1, E2
SUCCESS! recovered expected tree.
Failed 0 out of 1 tests(s).

image