OpenTreeOfLife / germinator

miscellaneous scripts and data for concerns that span more than one of the Open Tree code repositories: integration tests, system statistics, etc.
BSD 2-Clause "Simplified" License
21 stars 7 forks source link

make a decision on what to do with incertae sedis (etc.) taxa and their descendants #32

Open mtholder opened 8 years ago

mtholder commented 8 years ago

Some aspects of this problem (and some of the solutions discussed below) is discussed in more detail in reference-taxonomy issue 153. This issue was created to discuss not the taxonomic side of the phenomena (the increased number of incertae sedis groups), but the architectural decision to be made.

We discussed this some on a call a few weeks ago. I prefer solution 4 below, fwiw.

Problem:

Groups that are garbage bins are often labelled incertae sedis. The OTT recognizes and flags incertae sedis groups, and it needs to keep doing this to correctly reflect the meaning of the input classifications.

However, taxa that are descendants of incertae sedis groups are currently suppressed from the synthetic tree. Unfortunately, a fair number of "good" taxa have some uncertainty with respect to what higher taxon they belong to, and these taxa are missing from the synth tree (despite the fact that they are in OTT and may even be mapped by some source trees).

Solution 0 the current system.

Prune descendants of incertae sedis groups from synthesis inputs.

Pros
  1. web browsing views of large groups (e.g. Fungi) don't get cluttered with lots of taxa that are incertae sedis. I think this was the only motivation for this solution.
    Cons
  2. New user should expect (based on the descriptions of the OT project as being comprehensive) that the synth tree will be comprehensive with respect to the names that are in OTT which are thought to correspond to taxa.
  3. We don't display the taxonomy-without-incertae-sedis tree anywhere, so it is not easy for users to see why these taxa are missing.
  4. It seems ad hoc given that we know that a lot of the taxonomic groups in OTT are probably not monophyletic (and we are not really encouraging users to flag more groups as incertae sedis). In other words, it seems that the project does not really think that this is the correct solution for taxa for which we lack a good taxonomy.

    Solution 1: prune taxonomy-only incertae sedis

Prune descendants of incertae sedis groups from synthesis inputs only if they are not used in an input tree.

Pros:
  1. More taxa make it into the synth than when we use Solution 0.
  2. The poorly classified taxa that make it in will be placed (presumably fairly reliably) by phylogenetic statements.
  3. No changes to the front-end code are required.
  4. Still avoids huge polytomies at well-known groups (Fungi, Arthropoda...).
    Cons:

Same 3 downsides as Solution 0.

Solution 2: don't prune

Allow all taxa inside incertae sedis groups into input tree. Note that this does not mean that the incertae sedis flagged group will appear as a clade, merely that it descendants can be found in the tree.

Pros:
  1. Addresses the "cons" 1-3 of the solutions 0 and 1
  2. Only a trivial change to the code of the synthesis procedure is needed.
    Cons:
  3. huge polytomies at well known groups will make interactive browsing tedious and payloads of API calls huge (or result in errors because size limits are exceeded). This is sort of a straw man included for the sake of completeness.

    Solution 3: don't prune, flag-aware serving

Allow incertae sedis groups into input tree; but retain their flagging in the tool that serves the synthetic tree. Add an argument to the relevant API calls to optionally include these taxa in responses.

Pros:
  1. Addresses the "cons" 1-3 of the solutions 0 and 1.
    Cons:
  2. Requires some code changes on front and backend. The synth-server code changes and flagging should be trivial, but we would also need some GUI work (presumably, just a checkbox for "show incertae sedis")
  3. Some confusion is always possible when there are user-tweakable displays ("why does your browswer show a different tree than mine? Oh we had different filtering options set")

    Solution 4: don't prune, flag-aware serving, always return mapped

Implement solution 3, but make the taxa that are mapped to inputs returned regardless of the "include incertae sedis" argument. That boolean argument would mean "include incertae sedis taxa that are not covered by any phylogenetic input" in this version.

This is just a slight tweak of the default behaviors relative to solution 3. Here the web app by default would show the same tree as you would get with solution 1, or a more cluttered tree if the user requests it.

edited last sentence to say " same tree as you would get with solution 1" in response to @hyanwong 's comment

hyanwong commented 8 years ago

In the last line "the same tree as you would get with solution 2" - do you mean solution 1?

mtholder commented 8 years ago

yup. thanks. I'll fix that.

mtholder commented 8 years ago

@hyanwong noted on the "Missing fossils from OTT" https://groups.google.com/forum/?fromgroups&hl=en#!topic/opentreeoflife/QN2n17Gqylo thread, that the same arguments apply to fossil taxa as incertae sedis. I agree with that (and favor solution 4 for both).

edited: markdown

hyanwong commented 8 years ago

Solution 4 sounds good to me. Although unlike incertae sedis, for 'fossils=False' you would clearly want to omit fossil taxa even if they exist in a phylogenetic study.

kcranston commented 8 years ago

I favour solution 4 as well.

jar398 commented 8 years ago

I think it's misleading to describe this up front as a problem with incertae sedis taxa. We have many categories of taxa that this applies to. Yan mentions extinct, but there are others such as 'barren' (higher taxon containing no species) and hybrid. I don't know a good categorical name for these; I don't like 'dubious' for reasons I've stated elsewhere.

You didn't mention another con of all solutions except 0, which is that moving the taxa in question to locations given in source phylogenies changes the membership of all internal taxa from the correct location up to the common ancestor with the taxon's original place. That interacts with the way internal nodes are labeled in the output synthetic tree. If nodes are labeled, as they are now, by compatible membership, the labels will be lost because memberships won't be compatible. Other labeling strategies are possible, but we haven't talked about them.

I don't remember, but this problem may have been as much of a motivator for suppressing these things are the 'clutter' issue.

bomeara commented 8 years ago

I'd mildly prefer 4, but 1 and 3 are fine with me, too.

jar398 commented 8 years ago

I would like to hear from @blackrim since he was the one most recently looking at versions of the synthetic tree that contained these taxa. IIRC he had said that they "screw up synthesis" or wording to that effect. Or else I would like some experiments done to see whether they "screw up synthesis" or not. We can run with all suppressed, none suppressed, and taxonomy-only suppressed, and make lists of which internal taxonomic nodes are lost in the 2nd 2 cases relative to the 1st.

[added] well, obviously, the taxonomy-only nodes won't change the structure of the tree, so 1-3-4 are equivalent as far as what people will see in the UI. So there is really just one structural experiment to do, 0 vs. all the rest.

jar398 commented 8 years ago

Following today's discussion I would like to amend the issue description in the following ways:

There was rough consensus today on number 4, with the next big hurdle being what to do about the fact that putting these taxa back into synthesis causes OTT taxa like Eukaryota to become paraphyletic.

snacktavish commented 8 years ago

Following slack dicsussion with @jar398, I think the flags that we are considering not suppressing in this issue are: "major_rank_conflict", "major_rank_conflict_inherited", "incertae_sedis", "incertae_sedis_inherited", "unplaced", "unplaced_inherited" , "unclassified", "unclassified_inherited", which may together be considered 'incertae sedis' in the broad sense. Some potential taxa with these flags are also flagged "not_otu". These taxa will still be supressed.

jar398 commented 8 years ago

Thinking about how this information is passed to tm-lite. Tm-lite reads the taxonomy, so it has the flags, and it has the filtered flag set from the propinquity annotations file, and the annotations file also says which taxa are taxonomy-only (no conflict/support properties, or something like that). So it could indeed do conditional suppression when it serves arguson or newick via the subtree or induced_subtree methods, with no further changes to the propinquity/tm-lite interface.

As we have discussed there are two related issues that need to be addressed in order to make this work:

  1. Propinquity has to be able to use incertae sedis taxa without losing the identity of the nodes that they are placed in (if (a,b,(c,d)f)e and a is incertae sedis and a is placed in f, that does not reflect a conflict with f, because disjointness of a and f was never claimed)
  2. Suppression flags need to percolate upwards, i.e. if all the children of x are suppressed, then x should be suppressed too (smasher cannot do this because it doesn't know which flags lead to suppression). This could be done by tm-lite, by someone who knows neo4j, or it could be done by smasher, if smasher were told what the suppression flags were, or it could be done by propinquity
jar398 commented 8 years ago

I changed the issue title because it is only about taxa that are 'flagged' incertae sedis (and its equivalents: unplaced, unclassified, major-rank-conflict, etc.) - it is not about flagged taxa in general.

I had tried searching for this issue several times and the issue title threw me off. I think it will be easier to find now.

jar398 commented 8 years ago

This issue came up again today, where a user was expressing disappointment that so many mapped taxa were failing to appear in the trees he was fetching through the API.

The issue here was to make a decision. I'm not sure who has the authority to make such a decision. Would someone care to assign this issue to themselves?

jar398 commented 5 years ago

Close this issue? https://gitter.im/OpenTreeOfLife/public?at=5ba031801ee2ca65022fa37c

bredelings commented 5 years ago

Hmm.... I kind of hope we more to something more like solution 4 eventually. In that case maybe we should keep the issue open.