OpenTreeOfLife / feedback

No code -- just an issue tracker for general feedback (sent here via GitHub's issues API)
1 stars 0 forks source link

Changes in induced_subtree #423

Open miller34 opened 6 years ago

miller34 commented 6 years ago

This issue might be related to issue #419. It started a week or two ago. Using the induced_subtree option for the curl script, the returned newick tree is completely different than it used to be. In the wiki, the example command is as follows:

curl -X POST https://api.opentreeoflife.org/v2/tree_of_life/induced_subtree -H "content-type:application/json" -d '{"ott_ids":[292466, 501678, 267845, 666104, 316878, 102710, 176458]}'

with an expected output:

{
  "node_ids_not_in_tree" : [ ],
  "node_ids_not_in_graph" : [ ],
  "ott_ids_not_in_tree" : [ 501678, 176458 ],
  "tree_id" : "opentree4.0",
  "ott_ids_not_in_graph" : [ ],
  "newick" : "(Struthio_ott292466,((Clangula_ott316878,Perdix_ott102710)Galloanserae_ott5839486,(Cinclus_ott267845,Setophaga_ott666104))Neognathae_ott241846)Aves_ott81461;"
}

Now (after removing OTT ID 666104 because it no longer appears on the Tree of Life) the output is as follows:

{
 "newick": "(((((((((((((((((((((((((((((Cinclus_ott267845)Cinclidae_ott496027)mrcaott1566ott496009)mrcaott1566ott2175)mrcaott1566ott5934)mrcaott246ott1566)mrcaott246ott3364)mrcaott246ott1488)mrcaott246ott10351)mrcaott246ott193904)mrcaott246ott4820)mrcaott246ott191995)mrcaott246ott32658)mrcaott246ott5929)mrcaott246ott44866)mrcaott246ott428578)mrcaott246ott3212)Passeriformes_ott1041547)mrcaott246ott7113)mrcaott246ott47588)mrcaott246ott3600042)mrcaott246ott2907)mrcaott246ott1858)mrcaott246ott928360)mrcaott246ott5272)mrcaott246ott7145)mrcaott246ott5021,(((((((((((((((mrcaott164165ott164175)mrcaott164165ott810756)mrcaott164165ott759410)mrcaott11007ott164165)mrcaott11007ott141989)mrcaott5481ott11007)mrcaott5481ott141987)mrcaott5481ott11002)mrcaott5481ott18655)Trochilidae_ott810751)Trochiliformes_ott810738)mrcaott5481ott65384)mrcaott5481ott68112)mrcaott5481ott275128)mrcaott5481ott178241)mrcaott5481ott51270)mrcaott246ott5481,(((((((((((((Perdix_ott102710)mrcaott53700ott466627)mrcaott53700ott572162)mrcaott4765ott53700)mrcaott4765ott51354)mrcaott4765ott415487)mrcaott4765ott49310)mrcaott4765ott49319)mrcaott4765ott54193)mrcaott4765ott151684)mrcaott4765ott104461)mrcaott4765ott75785)Galliformes_ott837585,((((((Clangula_ott316878)mrcaott166683ott316879)mrcaott30843ott166683)mrcaott30843ott75874)Anatidae_ott765193)mrcaott30843ott714464)Anseriformes_ott241841)Galloanserae_ott5839486)Neognathae_ott241846,(((Struthio_ott292466)Struthionidae_ott857849)Struthioniformes_ott857847)Palaeognathae_ott81443)Aves_ott81461;",
 "supporting_studies": [
  "ot_425@tree1",
  "pg_2853@tree6624",
  "ot_290@tree1",
  "ot_534@tree1",
  "pg_420@tree522",
  "ot_500@tree1",
  "pg_2404@tree5068",
  "pg_2404@tree6388",
  "ot_111@tree1",
  "pg_2577@tree5980",
  "pg_2860@tree6646",
  "pg_2866@tree6656",
  "ot_314@tree1"
 ]
}
mtholder commented 6 years ago

Hi, Thanks for the report. This is related to us deprecating v2 of the API. There is some discussion of this on on our email list serve: https://groups.google.com/forum/#!topic/opentreeoflife-software/tF0cwvJ7nQA

We should be returning some 400-series http status code for v2 in the cases (like this one) in which the current version (v3) of the API differs from the v3.

If there is some particular bit of the info that you need that is no longer returned by the induced_tree call, let us know and we'll see if we can get that info back into the call. As you may have guessed, all of the nodes with names that start with "mrcaott..." are names that correspond to groups that are not known taxa in our taxonomy.

miller34 commented 6 years ago

Thanks for the quick reply. I understand how big of a headache it must be to roll out this new version. I like the addition of supporting studies.

It would be nice to have similar functionality as before where the unknown MRCA are collapsed and the returned newick tree is able to be used immediately. Right now there are a lot of single nodes [e.g., (Cinclus_ott267845)], which throws newick parsing errors in a lot of programs (such as ETE3).

bredelings commented 6 years ago

Hmm.... so the problem seems to be with out-degree=1 ancestor nodes. These really are part of the induced subtree, and there is nothing malformed about including them. So its not really a Newick error, although as you say a lot of programs can't handle them.

We could add a remove_monotypic_ancestors flag - it seems like that would be quite useful.

bredelings commented 6 years ago

Regarding the /v2/ api, I think there are just some apache routes rewriting these to /v3/. I suppose apache should probably pass these unchanged to the wrapper, and then handle them in the wrapper.

Hmm... I guess we could make the /v2/ calls errors only if the arguments have actually changed. On the other hand treating some /v2/ calls as errors and forwarding others to /v3/ could cause some surprise at a later point for /v2/ users.

bredelings commented 6 years ago

@miller34 It doesn't sound like you are missing any of the fields like ott_ids_not_in_tree?

I suppose if we put this functionality back in, we would report missing nodes through node_ids_not_in_tree.

jar398 commented 6 years ago

Re out-degree=1 ancestor nodes:

I thought about this a while back and got worried about the following issue: There could be N single-descendant nodes on a chain between two 2+-ancestor nodes, and then which one would you pick as the 'exemplar'? The one furthest from the root or closest to the root maybe, but either of these could be much less useful to the user than one of the intermediate nodes. E.g. the furthest could be Mjdhajsdhfjia, and the closest could be Mxkdfjksjia, neither of which anyone has ever heard of, while Mammalia might be somewhere in between. You could just leave it unlabeled, or provide some XmrcaY label computed in a predictable way.

miller34 commented 6 years ago

Yeah, I can definitely understand why you went this route, and I think that it is really useful for some applications. It might also be useful to have a flag to return the pruned unlabeled tree. If you're comparing species placement on trees using different tree reconstruction algorithms or just trying to see how species a related relative to each other, then you don't necessarily care about the intermediate nodes. I've made a wrapper on my end to do that, but I can imagine that a lot of systematists with limited programming experience might find the current output a bit overwhelming.

Also, I can understand why species names that appear multiple times in the Tree of Life are characterized differently, e.g., 'Nannospalax galili (species in domain Eukaryota) ott207281' because Nannospalax galili is also listed in bacteria, but it makes for ad hoc parsing, especially since with the ott id (and possible clade labels), you already know from which domain the species originated.

miller34 commented 6 years ago

@bredelings I think that I would prefer if the output formatting stayed consistent. So, if you decide to keep the current format, then I would prefer that the format stayed the same forever, with changes in the format being added as different options or different curl posts/responses at a later date.