OpenTreeOfLife / treemachine

Source tree graph database
Other
16 stars 6 forks source link

111 groupings in synthetic tree that are not supported by any input source tree #156

Closed mtholder closed 8 years ago

mtholder commented 9 years ago

Background

Issue https://github.com/OpenTreeOfLife/treemachine/issues/78 started because @ruchiherself's code identified cases in which a grouping in the synthetic tree conflicted with every tree in the input set. The definition of conflict is discussed in the "Conflict between trees and taxonomies" section of the supplemental material.

I started pursuing this using code that uses a slightly different criterion for flagging groups that I think are indicative of bugs in treemachine (or our failure to capture the inputs precisely enough, such that the inputs actually differ from what was fed into treemachine. Or bugs in the checking tools).

This issue separates discussion of the problematic cases detected by the definition that I am using from the cases that Ruchi's code flags.

"unsupported"

UPDATE I've revised this because Ruchi pointed out that I was not being consistent. The original text is not at the bottom of this post.

When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetric distance (RF) between the synthetic tree and any of the input trees will not increase.

For each input tree t, in the set of input trees T (which includs the taxonomic tree):

  1. use S(t) to denote the synthetic tree S restricted to the leaf set of t
  2. let r(S, t) be the RF distance between S(t) and t

If Y is the synthetic tree with some edge y collapsed, then we say that y is supported if r(Y, t) > r(S, t) for any t.

Software

I have written 2 tools to help find these cases:

These are in the examples subtree of NCL. I forked NCL to the Open Tree group to make it easier for any of us to modify it.

I've posted the contents of the standard output stream and the standard error stream.

There are 111 groupings that findunsupportededges found which are unsupported.

checktaxonnodes found 22 problems - those are reported on issue https://github.com/OpenTreeOfLife/treemachine/issues/154.

Differences from what Ruchi's code is calculating.

Under the Wilkinson terminology (if I'm understanding it correctly) if we had the synthetic tree of:

S = (A, (B, (C, D)))

from two inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))

then I think the clade (C, D) would be considered irrelevant on both trees. Ruchi's code is reporting conflicting cases, so this would not be reported.

Under the "unsupported" definition, that I am using, this grouping would be considered unsupported because the tree with the polytomy: (A,(B,C,D)) fits the inputs just as well. Intuitively there is no information in the inputs indicating that C is closer to D than it is to B, so it seems like we should be returning the polytomy.

This difference in evaluation explains why my software classifies this group to be unsupported, while Ruchi's code considers pg_2644_6164.tre to support it. The source tree does indeed have a grouping of (Aspidocarya + Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in the synthetic tree, we see that the sister group is Calycocarpum. Calycocarpum is not sampled in pg_2644_6164.tre. So, according to that source tree there is no reason that you could not have any resolution of the 3 way polytomy: (Calycocarpum, (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))

I think that Ruchi's list will be a subset of my list because of cases like this one. And this does not imply a bug in either - just different classification schemes.

Why I think this is a problem

All supertree methods have some quirks, so the presence of a few groupings that are not intuitive is not a problem per se. But I think these groups indicate that there is a bug in synthesis.

It could be that I am just misunderstanding the TAG procedure. If that is the case, I would appreciate some one correcting me. I thought that a valid description of the synthesis procedure would be:

  1. Add inputs to the TAG one at a time.
  2. For each node in an input tree _ti we create set of edges to a LICA node. These nodes may include to other taxa (because of other input trees). Crucially:

    A. This is the only operation that adds edges to the graph.

    B. The parent node of the edge will always be the MRCA of a larger set of leaves than the childe node - even when restricte to the leaf set of _ti.

    C. Thus, _ti will support any edge that is created by its introduction into the TAG.

    D. Thus, every edge in the TAG will be supported by at least one input.

  3. the synthesis operation only decides what edges to "trace" to make a tree. It does not create new edges.

If all of that is correct, then every edge/grouping in the synthetic tree should be supported by at least one input. So my checktaxonnode and findunsupportednodes programs should also report no problems.

updated: typo in the first word of the description fixed. Doh!

Original incorrect definition of unsupported

just for the record. here is the text that was originally above...

When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetrict distance (RF) between the synthetic tree and the set of inputs would not change.

We can calculate the total RF distance for the synthetic tree S as follows:

For each input tree t, in the set of input trees T (which includs the taxonomic tree):

  1. use S(t) to denote the synthetic tree S restricted to the leaf set of t
  2. let r(S, t) be the RF distance between S(t) and t

Then the total RF distance R(S,T) is simply the sum of r(S,t)

blackrim commented 9 years ago

Just to be clear, tree6165 does support that (Aspidocarya + Parabaena) + (Tinomiscium + Tinospora) grouping with Calycocarpum though (those are all sampled in tree6165). So that isn't one of the unsupported nodes is it? Unless I am missing something there.

There were some others that were pointed out in the previous emails that I will check up on. Cody may need to explain how we place the non monophyletic taxa because that may be where this is coming up. Otherwise, there is likely a bug.

On Fri, Jan 30, 2015 at 8:08 AM, Mark T. Holder notifications@github.com wrote:

Backgroud

Issue #78 https://github.com/OpenTreeOfLife/treemachine/issues/78 started because @ruchiherself https://github.com/ruchiherself's code identified cases in which a grouping in the synthetic tree conflicted with every tree in the input set. The definition of conflict is discussed in the "Conflict between trees and taxonomies" section of the supplemental material https://docs.google.com/document/d/1qq9VZccfPMG9Xic0wmp5BXMur98KrjXOY3-ZVuKzz1U/edit#heading=h.l47v7xs1he4q .

I started pursuing this using code that uses a slightly different criterion for flagging groups that I think are indicative of bugs in treemachine (or our failure to capture the inputs precisely enough, such that the inputs actually differ from what was fed into treemachine. Or bugs in the checking tools).

This issue separates discussion of the problematic cases detected by the definition that I am using from the cases that Ruchi's code flags. "unsupported"

When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetrict distance (RF) between the synthetic tree and the set of inputs would not change.

We can calculate the total RF distance for the synthetic tree S as follows:

For each input tree t, in the set of input trees T (which includs the taxonomic tree):

1.

use S(t) to denote the synthetic tree S restricted to the leaf set of t 2.

let r(S, t) be the RF distance between S(t) and t

Then the total RF distance R(S,T) is simply the sum of r(S,t) Software

I have written 2 tools to help find these cases:

  • checktaxonnodes checks all named nodes in the synthetic tree against their definition in OTT.
  • findunsupportededges looks for internal nodes in the synthetic tree that:
    • do not have a name and
    • which are not supported by any non-taxonomic input

These are in the examples subtree of NCL. I forked NCL to the Open Tree group https://github.com/OpenTreeOfLife/ncl to make it easier for any of us to modify it.

I've posted the contents of the standard output stream http://phylo.bio.ku.edu/ot/findunsupportededges-out.txt and the standard error stream http://phylo.bio.ku.edu/ot/findunsupportededges-err.txt.

There are 111 groupings that findunsupportededges found which are unsupported.

checktaxonnodes found 22 problems - those are reported on issue #154 https://github.com/OpenTreeOfLife/treemachine/issues/154. Differences from what Ruchi's code is calculating.

Under the Wilkinson terminology (if I'm understanding it correctly) if we had the synthetic tree of:

S = (A, (B, (C, D)))

from two inputs:

t_1 = (A, (B, C)) t_2 = (A, (B, D))

then I think the clade (C, D) would be considered irrelevant on both trees. Ruchi's code is reporting conflicting cases, so this would not be reported.

Under the "unsupported" definition, that I am using, this grouping would be considered unsupported because the tree with the polytomy: (A,(B,C,D)) fits the inputs just as well. Intuitively there is no information in the inputs indicating that C is closer to D than it is to B, so it seems like we should be returning the polytomy.

This difference in evaluation explains why my software classifies this group https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3840208 to be unsupported, while Ruchi's code considers pg_2644_6164.tre https://tree.opentreeoflife.org/curator/study/view/2644?tab=trees&tree=tree6164 to support it. The source tree does indeed have a grouping of (Aspidocarya

  • Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in the synthetic tree https://tree.opentreeoflife.org/opentree/otol.draft.22@3840209, we see that the sister group is Calycocarpum. Calycocarpum is not sampled in pg_2644_6164.tre. So, according to that source tree there is no reason that you could not have any resolution of the 3 way polytomy: (Calycocarpum, (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))

I think that Ruchi's list will be a subset of my list because of cases like this one. And this does not imply a bug in either - just different classification schemes. Why I think this is a problem

All supertree methods have some quirks, so the presence of a few groupings that are not intuitive is not a problem per se. But I think these groups indicate that there is a bug in synthesis.

It could be that I am just misunderstanding the TAG procedure http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003223. If that is the case, I would appreciate some one correcting me. I thought that a valid description of the synthesis procedure would be:

1.

Add inputs to the TAG one at a time. 2.

For each node in an input tree _ti we create set of edges to a LICA node. These nodes may include to other taxa (because of other input trees). Crucially:

A. This is the only operation that adds edges to the graph.

B. The parent node of the edge will always be the MRCA of a larger set of leaves than the childe node - even when restricte to the leaf set of _ti.

C. Thus, _ti will support any edge that is created by its introduction into the TAG.

D. Thus, every edge in the TAG will be supported by at least one input. 3.

the synthesis operation only decides what edges to "trace" to make a tree. It does not create new edges.

If all of that is correct, then every edge/grouping in the synthetic tree should be supported by at least one input. So my checktaxonnode and findunsupportednodes programs should also report no problems.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156.

mtholder commented 9 years ago

But that source tree (6165) has Calycocarpum as sister to a group containing Aspidocarya, Parabaena, Tinomiscium, Tinospora but also Orthogynium.

In the synthetic tree, Orthogynium attachs well outside (https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3573300/Orthogynium) of this group. Which is why my findunsupportednodes does not give that tree credit for supporting that split

blackrim commented 9 years ago

OK. This one looks to me like it is likely a non monophyly thing. I will check on that. The sentence needs to be deleted from the response to reviewers anyway because i am pretty sure with the nonmonophyly (regardless of whether this is a case of that) there are edges added but Cody will need to chime in on that (I will ask him to if he doesn't and I see him in a bit).

On Fri, Jan 30, 2015 at 9:17 AM, Mark T. Holder notifications@github.com wrote:

But that source tree (6165) has Calycocarpum as sister to a group containing Aspidocarya, Parabaena, Tinomiscium, Tinospora but also Orthogynium.

In the synthetic tree, Orthogynium attachs well outside ( https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3573300/Orthogynium) of this group. Which is why my findunsupportednodes does not give that tree credit for supporting that split

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156#issuecomment-72207269 .

mtholder commented 9 years ago

I think Orthogynium is a monotypic taxon.

blackrim commented 9 years ago

Yeah, I don't mean within that group I mean within the Menispermoideae

On Fri, Jan 30, 2015 at 11:04 AM, Mark T. Holder notifications@github.com wrote:

I think Orthogynium is a monotypic taxon.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156#issuecomment-72223878 .

blackrim commented 9 years ago

Or within the Menispermaceae rather

On Fri, Jan 30, 2015 at 11:25 AM, Stephen Smith blackrim@gmail.com wrote:

Yeah, I don't mean within that group I mean within the Menispermoideae

On Fri, Jan 30, 2015 at 11:04 AM, Mark T. Holder <notifications@github.com

wrote:

I think Orthogynium is a monotypic taxon.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156#issuecomment-72223878 .

mtholder commented 9 years ago

I don't understand what you mean by "it is likely a non monophyly thing."

Menispermaceae is not a tip label in any of the input trees. So I don't understand how it being non-monophyletic is different from other cases of conflict between different sources of phylogenetic information.

Could you or @chinchliff or @josephwb confirm that the 3 numbered points that I list above are a correct characterization of the synthesis. procedure.

I suppose I should add another statement:

4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups.

if that is not the case (or any of my previous 3 statements are incorrect) then the 111 groupings reported here may just be a wart of the procedure and not a bug.

Edit. markdown cause my #4 to show up as 1. fixed.

ruchiherself commented 9 years ago

Mark, I want to add something here. For our input (taxonomy + 484 other trees) your definition of 'unsupported' and my definition for unsupported are (probably) the same. The example under "Differences from what Ruchi's code is calculating." is correct. But remember we have taxonomy in the input. Both synthetic tree and taxonomy are of the same size (# of leaves) and taxonomy can never compute 'irrelevant' for any node of the synthetic tree. So the disagreement between our definitions (explained through this example) doesn't really apply in our case. For example, for the given tree S = (A, (B, (C, D))) we will always have a input tree of the same size. In this input tree either C and D will make a clade or not, so support or no support, respectively.

I have already thoroughly studied some of your identified nodes (or groupings) that I didn't have in my list. I have this feeling that our lists of unsupported nodes will be identical. However all my analysis is based on the Newick strings that I received from Joseph. If they are wrong then I can not guarantee anything.

On Fri, Jan 30, 2015 at 7:08 AM, Mark T. Holder notifications@github.com wrote:

Backgroud

Issue #78 https://github.com/OpenTreeOfLife/treemachine/issues/78 started because @ruchiherself https://github.com/ruchiherself's code identified cases in which a grouping in the synthetic tree conflicted with every tree in the input set. The definition of conflict is discussed in the "Conflict between trees and taxonomies" section of the supplemental material https://docs.google.com/document/d/1qq9VZccfPMG9Xic0wmp5BXMur98KrjXOY3-ZVuKzz1U/edit#heading=h.l47v7xs1he4q .

I started pursuing this using code that uses a slightly different criterion for flagging groups that I think are indicative of bugs in treemachine (or our failure to capture the inputs precisely enough, such that the inputs actually differ from what was fed into treemachine. Or bugs in the checking tools).

This issue separates discussion of the problematic cases detected by the definition that I am using from the cases that Ruchi's code flags. "unsupported"

When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetrict distance (RF) between the synthetic tree and the set of inputs would not change.

We can calculate the total RF distance for the synthetic tree S as follows:

For each input tree t, in the set of input trees T (which includs the taxonomic tree):

1.

use S(t) to denote the synthetic tree S restricted to the leaf set of t 2.

let r(S, t) be the RF distance between S(t) and t

Then the total RF distance R(S,T) is simply the sum of r(S,t) Software

I have written 2 tools to help find these cases:

  • checktaxonnodes checks all named nodes in the synthetic tree against their definition in OTT.
  • findunsupportededges looks for internal nodes in the synthetic tree that:
    • do not have a name and
    • which are not supported by any non-taxonomic input

These are in the examples subtree of NCL. I forked NCL to the Open Tree group https://github.com/OpenTreeOfLife/ncl to make it easier for any of us to modify it.

I've posted the contents of the standard output stream http://phylo.bio.ku.edu/ot/findunsupportededges-out.txt and the standard error stream http://phylo.bio.ku.edu/ot/findunsupportededges-err.txt.

There are 111 groupings that findunsupportededges found which are unsupported.

checktaxonnodes found 22 problems - those are reported on issue #154 https://github.com/OpenTreeOfLife/treemachine/issues/154. Differences from what Ruchi's code is calculating.

Under the Wilkinson terminology (if I'm understanding it correctly) if we had the synthetic tree of:

S = (A, (B, (C, D)))

from two inputs:

t_1 = (A, (B, C)) t_2 = (A, (B, D))

then I think the clade (C, D) would be considered irrelevant on both trees. Ruchi's code is reporting conflicting cases, so this would not be reported.

Under the "unsupported" definition, that I am using, this grouping would be considered unsupported because the tree with the polytomy: (A,(B,C,D)) fits the inputs just as well. Intuitively there is no information in the inputs indicating that C is closer to D than it is to B, so it seems like we should be returning the polytomy.

This difference in evaluation explains why my software classifies this group https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3840208 to be unsupported, while Ruchi's code considers pg_2644_6164.tre https://tree.opentreeoflife.org/curator/study/view/2644?tab=trees&tree=tree6164 to support it. The source tree does indeed have a grouping of (Aspidocarya

  • Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in the synthetic tree https://tree.opentreeoflife.org/opentree/otol.draft.22@3840209, we see that the sister group is Calycocarpum. Calycocarpum is not sampled in pg_2644_6164.tre. So, according to that source tree there is no reason that you could not have any resolution of the 3 way polytomy: (Calycocarpum, (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))

I think that Ruchi's list will be a subset of my list because of cases like this one. And this does not imply a bug in either - just different classification schemes. Why I think this is a problem

All supertree methods have some quirks, so the presence of a few groupings that are not intuitive is not a problem per se. But I think these groups indicate that there is a bug in synthesis.

It could be that I am just misunderstanding the TAG procedure http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003223. If that is the case, I would appreciate some one correcting me. I thought that a valid description of the synthesis procedure would be:

1.

Add inputs to the TAG one at a time. 2.

For each node in an input tree _ti we create set of edges to a LICA node. These nodes may include to other taxa (because of other input trees). Crucially:

A. This is the only operation that adds edges to the graph.

B. The parent node of the edge will always be the MRCA of a larger set of leaves than the childe node - even when restricte to the leaf set of _ti.

C. Thus, _ti will support any edge that is created by its introduction into the TAG.

D. Thus, every edge in the TAG will be supported by at least one input. 3.

the synthesis operation only decides what edges to "trace" to make a tree. It does not create new edges.

If all of that is correct, then every edge/grouping in the synthetic tree should be supported by at least one input. So my checktaxonnode and findunsupportednodes programs should also report no problems.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156.

mtholder commented 9 years ago

Ruchi, you are correct that we have the full taxonomy, but it is highly unresolved. You can easily expand the case that I gave earlier. Consider:

S = (A, (B, (C, D)))

from 3 inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))
t_3 = (A, (B, C, D))

I think that your code would say that the (C, D) group is irrelevant wrt the first 2 trees, and permitted by the third. So not "in conflict"

My code would call it "unsupported".

ruchiherself commented 9 years ago

My 87 nodes include this case too. I am counting all those nodes that have 0 support, but may have permit, conflict, or irrelevant from all the input trees.

In your example, (C,D) group will get irrelevant from first two input trees and permit from the last input tree as you said. So for (C,D) group, support = 0, permit = 1, conflict = 0, and irrelevant = 2. So (C,D) group must be in my list since support is 0 for it.

On Fri, Jan 30, 2015 at 1:26 PM, Mark T. Holder notifications@github.com wrote:

Ruchi, you are correct that we have the full taxonomy, but it is highly unresolved. You can easily expand the case that I gave earlier. Consider:

S = (A, (B, (C, D)))

from 3 inputs:

t_1 = (A, (B, C)) t_2 = (A, (B, D)) t_3 = (A, (B, C, D))

I think that your code would say that the (C, D) group is irrelevant wrt the first 2 trees, and permitted by the third. So not "in conflict"

My code would call it "unsupported".

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156#issuecomment-72255750 .

mtholder commented 9 years ago

Ah. I see. thanks for clarifying. But I think that our codes would diverge on:

S = (A, (B, (C, D)))

from 4 inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))
t_3 = (A, (C, D))
t_4 = (A, (B, C, D))

My code would call still call the (C,D) clade "unsupported" because none of the inputs say that C is closer to D than it is to B

ruchiherself commented 9 years ago

Wait...but your initial definition of "unsupported" doesn't approve of that.


Unsupported: When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetrict distance (RF)

between the synthetic tree and the set of inputs would not change.

So (C,D) is not "unsupported" group by your definition. Since if we collapse (C,D) into its parent (B,C,D) in 'S' then R(S,T) becomes 0, but it was 1 before collapsing.

On Fri, Jan 30, 2015 at 2:14 PM, Mark T. Holder notifications@github.com wrote:

Ah. I see. thanks for clarifying. But I think that our codes would diverge on:

S = (A, (B, (C, D)))

from 4 inputs:

t_1 = (A, (B, C)) t_2 = (A, (B, D)) t_3 = (A, (C, D)) t_4 = (A, (B, C, D))

My code would call still call the (C,D) clade "unsupported" because none of the inputs say that C is closer to D than it is to B

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156#issuecomment-72263267 .

mtholder commented 9 years ago

good point. I should have said that the RF distance stays the same or decreases. So the unresolved form of the synthetic tree is at least as good as the resolved form when there is an "unsupported" node.

Sorry for the confusion.

mtholder commented 9 years ago

I hadn't been thinking of unresolved inputs clearly when I wrote this issue report. What I should have said was:

By "unsupported" I mean that if we collapse the edge, the RF distance for the restricted synthetic tree to each of the source trees is unchanged or decreases.

my code doesn't calculate the total RF. It just tries to find (for every edge in the synthetic tree) at least 1 input tree that supports the edge. If collapsing the edge causes the RF to any of the input trees to increase, then it calls the edge supported. Sorry again for mis-stating this earlier.

ruchiherself commented 9 years ago

I think I understand it now. It's different from my count. I declare support for a node in the synthetic tree, if there is at least one tree in the input that has an identical clade (after restricting of course). But Mark's definition finds support for a node in the synthetic tree if the RF distance from at least one input tree goes up after collapsing this node. I think the RF distance for only those input trees can go up who originally had identical clade (or who were supporting by my definition). In particular, RF from those input trees that initially had identical clades can either stay the same or can go up. Remaining trees are either irrelevant or their RF goes down (i.e., for permit or conflict cases).

My analysis should have the subset of Mark's nodes. I also think that these extra nodes (Mark's nodes - my nodes) can be computed using Wilkinson et al.'s strongest support (page 828 in that paper). I have computed those number but never included them in the Science or PNAS paper. I can provide them if they are useful.

On Fri, Jan 30, 2015 at 2:54 PM, Mark T. Holder notifications@github.com wrote:

I hadn't been thinking of unresolved inputs clearly when I wrote this issue report. What I should have said was:

By "unsupported" I mean that if we collapse the edge, the RF distance for the restricted synthetic tree to each of the source trees is unchanged or decreases.

my code doesn't calculate the total RF. It just tries to find (for every edge in the synthetic tree) at least 1 input tree that supports the edge. If collapsing the edge causes the RF to any of the input trees to increase, then it calls the edge supported. Sorry again for mis-stating this earlier.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/156#issuecomment-72268680 .