OpenTreeOfLife / otcetera

C++20 lib for manipulations of phylogenetic trees and supertree operations
Other
4 stars 3 forks source link

Decomposition should always return meaningful phylogenetic statements #14

Closed josephwb closed 9 years ago

josephwb commented 9 years ago

For input tree "pg_335_360":

((ott229542,ott348531),(ott625710,ott258235));

it is decomposed into:

1. subprob ott348529:
(ott258235,ott625710);
2. subprob ott42109:
(ott348529,ott5345854);

However, these are not phylogenetic statements. Treemachine will skip such trees. The result is that the sister relationship between "Magnaporthe" and "Magnaporthiopsis" (despite kind of being represented by subprob 2) does not make it into the DB.

Another (more important) example is "pg_2573_5959", which involves the placement of turtles. The labelled newick is:

(((Chrysemys_picta_ott587300,Pelomedusa_subrufa_ott775044),((Crocodylus_porosus_ott35872,Alligator_mississippiensis_ott335590),(Gallus_gallus_ott153563,Taeniopygia_guttata_ott708327))),(Sphenodon_punctatus_ott35890,(Anolis_carolinensis_ott705356,Pantherophis_guttatus_ott456236)));

The decomposition into separate subproblems is:

  1. Birds (2 tips)
  2. Crocs (2 tips)
  3. Archosauria (2 tips) i.e. roots of 1 & 2
  4. Turtles (2 tips)
  5. Testudines + Archosaurs (2 tips) i.e. roots of 3 & 4
  6. Lepidosauria (3 tips)
  7. Lepidosauria + Testudines-Archosaurs (2 tips) i.e. roots of 5 & 6

These newicks (except no. 6) are ignored by treemachine, so the placement of turtles is lost (fortunately it corresponds to taxonomy, but that is beside the point). I imagine this is more than just a problem working with treemachine.

What I think would be better: whenever such non-phylogenetic statements are made, move it to the parent subproblem.

I understand that the work involved is not insignificant :bowtie:

mtholder commented 9 years ago

The motivation for the subproblems decomposition is to support divide and conquer approaches which will treat the subproblems a set of problems that can be solved and then stitched back together by grafting the problem ID onto the deeper subproblem that has that ID as a leaf.

If I were to start emitting these statements in the parent subproblems, then the subproblems would not be independent. One would have to check the ancestral subproblem when solving. And the grafting step could be more complicated. So I really do not want to do that by default.

We could add another type of decomposition, but I'm afraid that I don't see the point.

I agree that there are trivial trees (trees which make no phylogenetic statements) written out to the subproblems. But this is useful if supertree tool constructs the full tree and then wants to "trace" the edges of the supertree that the inputs traverse. So I see a downside to suppressing this information or moving it deeper in the output.

I don't really see the advantage of removing this info (and letting client code ignore trivial statements). So I don't see a big positive.

The fact that the info about turtle+archosaurs agrees with taxonomy is not "beside the point"; it is critical to how the decomposition works. If there had not been a taxon (ottID=4947372 and number 5 in your list) which put turtles and Archosaurs as sister, then that subproblem (ott4947372) would not have been written. So the phylogenetic inputs favoring turtles+archosaurs would have shown up in the deeper subproblem (Sauria ott329823 = number 7 in your list).

Tt is not a lucky coincidence that the info is encoded in the taxonomy. If there is conflict with the taxonomy, the taxon will be "contested" and a deeper subproblem will be emitted. If the phylogenetic statement from a source tree is redundant with the taxonomy, then you'll get a subproblem with trivial groups.

I'm sorry if I am just misreading your request, I just don't see why the current behavior is a problem.

update: tiny edit to replace pound signs with "number" to avoid unintended markdown interpretation.

josephwb commented 9 years ago

I wasn't clear: I meant repackaging trees into subproblems such that the subproblems are still independent. So, for example, if we encounter a trivial newick, move that entire subproblem up into the parent subproblem; subproblems should still be independent.

Like I said: not an easy feature to pull off.

mtholder commented 9 years ago

But why? Fewer, bigger subproblems leads to slower running times (assuming that the supertree operation is worse than linear wrt number of tips - which it will be). The grafting operation could be linear wrt the number of subproblems (because one could uses something like a constant-time lookup hash table to find the subproblem's structure from its rootID).

Under such a situation, you typically want to divide into as many small problems as you can.

josephwb commented 9 years ago

Yeah, I understand all that. I am arguing from a practical standpoint with treemachine: we cannot represent the tree (A,B); or (A);

kcranston commented 9 years ago

Do we retain the information about trees that support a given edge even when the subproblems are trivial?

blackrim commented 9 years ago

Joseph and I just talked about this and I would say that I agree with Mark on this. Larger subproblems will just increase the run times and complexity. There are other ways to retain the information about the trivial trees (single tips and two tips).

On Wed, Apr 29, 2015 at 8:57 AM, Mark T. Holder notifications@github.com wrote:

But why? Fewer, bigger subproblems leads to slower running times (assuming that the supertree operation is worse than linear wrt number of tips - which it will be). The grafting operation could be linear wrt the number of subproblems (because one could uses something like a constant-time lookup hash table to find the subproblem's structure from its rootID).

Under such a situation, you typically want to divide into as many small problems as you can.

— Reply to this email directly or view it on GitHub.

josephwb commented 9 years ago

Talked this over with @blackrim. It is clear now that I am an idiot: the answer should be the same, but would take far greater comptational effort to get there. Request withdrawn.