OpenTreeOfLife / otcetera

C++20 lib for manipulations of phylogenetic trees and supertree operations
Other
4 stars 3 forks source link

otcprunetaxonomy not working correctly(?) #3

Open josephwb opened 9 years ago

josephwb commented 9 years ago

I'm trying to test sub-synth trees for unsupported edges, but am having problems with "missing" OTT IDs (specifically, an OTT ID present in the taxonomy, but not the synth tree or input trees).

What I am doing: First: prune the taxonomy tree using the synthetic tree label set:

josephwb@WOPR:~/Desktop/test_synth$ otcprunetaxonomy Fungi_taxonomy.tre fungi.synth.tre > Pruned_Fungi_taxonomy.tre
2015-03-28 22:37:40,470 INFO  [default] reading "Fungi_taxonomy.tre"...
2015-03-28 22:37:46,285 INFO  [default] reading "fungi.synth.tre"...

Second: run the unsupported edge test:

josephwb@WOPR:~/Desktop/test_synth$ otcfindunsupportednodes Pruned_Fungi_taxonomy.tre fungi.synth.tre fungitrees/*.tre
2015-03-28 22:38:32,261 INFO  [default] reading "Pruned_Fungi_taxonomy.tre"...
2015-03-28 22:38:41,683 INFO  [default] reading "fungi.synth.tre"...
2015-03-28 22:38:49,124 INFO  [default] reading "fungitrees/ott1010493.tre"...
2015-03-28 22:38:49,124 INFO  [default] reading "fungitrees/ott1026597.tre"...
2015-03-28 22:38:49,126 INFO  [default] reading "fungitrees/ott103001.tre"...
2015-03-28 22:38:49,127 INFO  [default] reading "fungitrees/ott103002.tre"...
2015-03-28 22:38:49,132 INFO  [default] reading "fungitrees/ott1031212.tre"...
2015-03-28 22:38:49,132 INFO  [default] reading "fungitrees/ott104185.tre"...
2015-03-28 22:38:51,161 INFO  [default] reading "fungitrees/ott1082072.tre"...
2015-03-28 22:38:51,166 INFO  [default] reading "fungitrees/ott1098854.tre"...
ERROR. Exiting due to an exception:
OTT id not found 4085684

The OTT ID 4085684 appears both in my original "Fungi_taxonomy.tre" and otcetera-generated "Pruned_Fungi_taxonomy.tre". The taxon itself is barren. I can filter such "dubious" taxa from my taxonomy tree, but I thought otcprunetaxonomy would accomplish this.

mtholder commented 9 years ago

Can you post the trees somewhere? Also note that a recent commit added a - between the words in the tools dir of otc. So these commands will be otc-prune-... and otc-find-un...

josephwb commented 9 years ago

Link here to the 3 input/output files mentioned above.

mtholder commented 9 years ago

Hmm. I don't see that taxon in either the fungi.synth.tre that you posted or the Fungi_taxonomy.tre that you posted. I do see it in the Pruned_Fungi_taxonomy.tre I believe that otc-find-unsupported-nodes assumes that the leaf set of the first 2 trees (the taxonomy and the synthetic tree) are identical. So perhaps the lack of that taxon in fungi.synth.tre is causing this.

josephwb commented 9 years ago

Ok, maybe that was an old file.

josephwb commented 9 years ago

Ok, yeah, I just included the wrong files in the link above. This one should work (or, er, shouldn't work. As expected, that is.).

grep 4085684 Unfiltered_Fungi_taxonomy.tre
# TRUE
grep 4085684 fungi.synth.tre
# FALSE
otcprunetaxonomy Unfiltered_Fungi_taxonomy.tre fungi.synth.tre > Pruned_Fungi_taxonomy.tre
grep 4085684 Pruned_Fungi_taxonomy.tre
# TRUE
mtholder commented 9 years ago

4085684 is not in fungi.synth.tre, but that tree does have a tip that is mapped to the parent of 4085684. I think that the prune taxonomy expands all non-terminal taxa that are assigned to tips to the full set of terminal taxa below them:

$ grep -o -P '.{0,40}529465.{0,20}' fungi.synth.tre
choascus_ott4085916,Phaffomycetaceae_ott529465,((Clavispora_opunti
$ grep -o -P '.{0,20}4085684.{0,20}' Unfiltered_Fungi_taxonomy.tre 
31290,ott4931289,ott4085684)ott529465,((ott4085

I think that is why this tip is not getting pruned.

josephwb commented 9 years ago

So, that won't work, right? Here I pruned by synth tree and inputs:

otcprunetaxonomy Unfiltered_Fungi_taxonomy.tre fungi.synth.tre fungitrees/*.tre > Pruned_Fungi_taxonomy_all-inputs.tre

Still contains the taxon:

grep 4085684 Pruned_Fungi_taxonomy_all-inputs.tre
# TRUE

When I run otcfindunsupportednodes, it terminates as above because the synth tree does not contain the taxon.

BTW, the error reported above (note: the problematic taxon is different, because I used a different taxonomy tree):

josephwb@WOPR:~/Desktop/for_MTH2$ otcfindunsupportednodes Pruned_Fungi_taxonomy_all-inputs.tre fungi.synth.tre fungitrees/*.tre
2015-03-30 10:04:02,475 INFO  [default] reading "Pruned_Fungi_taxonomy_all-inputs.tre"...
2015-03-30 10:04:12,415 INFO  [default] reading "fungi.synth.tre"...
2015-03-30 10:04:19,975 INFO  [default] reading "fungitrees/ott1010493.tre"...
2015-03-30 10:04:19,975 INFO  [default] reading "fungitrees/ott1026597.tre"...
2015-03-30 10:04:19,977 INFO  [default] reading "fungitrees/ott103001.tre"...
2015-03-30 10:04:19,978 INFO  [default] reading "fungitrees/ott103002.tre"...
2015-03-30 10:04:19,983 INFO  [default] reading "fungitrees/ott1031212.tre"...
2015-03-30 10:04:19,984 INFO  [default] reading "fungitrees/ott104185.tre"...
ERROR. Exiting due to an exception:
OTT id not found 222914

is a little confusing: it dies in the middle of processing the inputs (i.e. not all inputs are lited above; many more to go). It seems like file "ott1098854.tre" is the problem, but I believe it is simply that "fungi.synth.tre" does not contain the taxon that is present in "Pruned_Fungi_taxonomy_all-inputs.tre". It seems like it starts processing the input trees before it has decided there is a conflict between the taxonomy and synth trees.

Anyway, unless I am off my rocker, this is not working as desired (i.e. to produce files condusive to downstream testing).

josephwb commented 9 years ago

Just a follow-up note on this.

The key to avoiding this problem (for me) is to first filter (currently using python) taxonomy by what treemachine skips (so that the taxonomy should contain the same tip set as the synthetic tree). Problem is, especially for the proverbial Joe-Shmoe, what treemachine actually uses to filter is not immediately obvious (e.g.).

So, yeah, it would be great if this method threw out such problematic taxa as above using just the inputs, so that a user wouldn't have to be aware of whatever taxonomy pruning has occurred elsewhere.

mtholder commented 9 years ago

I'll certainly leave this open as a feature request. the otc-uncontested-decompose and otc-find-unsupported-nodes were written assuming that the leaf set of the taxonomy and the leaf set of the synth tree are the same. Shouldn't be too hard to deal with these non-terminal cases internally. But I'm afraid that I won't get to it soon. So the workaround of using a taxonomy and synth tree with the same leaf set will have to do for the time being (unless someone else wants to fix this).

It is far from complete, but I have been working on documenting the thinking behind otcetera in the doc subdir (pdf version posted http://phylo.bio.ku.edu/ot/summarizing-taxonomy-plus-trees.pdf )

josephwb commented 9 years ago

Agreed. Not really an issue for me now that I've got an appropriately filtered taxonomy.