OpenTreeOfLife / germinator

miscellaneous scripts and data for concerns that span more than one of the Open Tree code repositories: integration tests, system statistics, etc.
BSD 2-Clause "Simplified" License
21 stars 7 forks source link

API v4 tweaks for a synth tree that deals with _incertae sedis_ taxa #123

Open mtholder opened 7 years ago

mtholder commented 7 years ago

@bredelings and I are working on the propinquity and otcetera changes needed to support treating incertae sedis taxa correctly. One wrinkle is that the same node can be identified by multiple OTT IDs:

So if the taxonomy is:

((A1_ott1,A2_ott2)A_ott3,(B1_ott4,B2_ott5)B_ott6*,(C1_ott7,C2_ott8)C_ott9*); 

with asterisks denoting incertae sedis taxa, and the synth tree is:

((A1,A2)A,((B1,C1)mrcaB1C1,(B2,C2)mrcaB2C2)x);

then the node x could be labeled B_ott6 or C_ott9. We may not have any such cases in a synthetic tree, but we should probably figure out what we are going to do for when they start showing up.

It would be easy to list these synonomies in the annotations file produced by propinquity. It is less clear how they would be dealt with in web services. In particular, several tree-of-life calls return an ott ID.

Should that field be expanded to be an array of integers, or should we just pick one (e.g. the one with the lowest number) and list the synonyms in an additional field?

The larger issue is that any naming scheme in the face of incertae sedis taxa requires some definition of what the OTT IDs mean. My gut instinct would be to say that:

  1. the interpretation to be that the IDs are versioned by the taxonomy version. and
  2. For any particular version of OTT, the definition of a taxon is taken to be "the clade rooted at MRCA of all of the included taxa (descendants of the taxon) as long as that node excludes the entire exclude set of the taxon." The exclude set of a focal taxon is "all of the taxa outside of the focal taxon with the exception of any taxa that are descendants of incertae sedis taxa which are children of any ancestor of the focal taxon." Those incertae sedis taxa represent a "nonexcluded" set for the focal taxon. They can be inside or outside without changing the taxon's name.
kcranston commented 7 years ago

Couple of questions:

mtholder commented 7 years ago

We could use the MRCA notation, but I think we still have to communicate to the user that things have changed and now it is possible for one node in the tree to match >1 ott definition. Or, at least it makes sense to me that we'd want to communicate that to users.

In answer to the second point: yes, I was thinking of both B_ott6 and C_ott9 as valid taxa, they just cannot be excluded from intruding on other taxa in the tree. So not tips labeled "unclassified blah".