OpenTreeOfLife / opentree

Opentree browsing and curation web site. For overarching or cross-repo concerns, please see the 'germinator' repo.
http://tree.opentreeoflife.org/
BSD 2-Clause "Simplified" License
107 stars 26 forks source link

Sorting children list in taxonomy browser #831

Open jar398 opened 8 years ago

jar398 commented 8 years ago

The categories "Children included in the synthetic tree" and "Children suppressed from the synthetic tree" don't make sense; different children are included in different synthetic trees, and this is not a function of the taxonomy.

May I propose we do away with discrete categorization of the children and instead just use sort order and let the reader figure out their own categories based on the flags?

Here is an idea for the sort order. Assign priority values to the following groups of flags:

Group 3 [typically suppressed from synthesis by curatorial discretion]:

"extinct",
"viral",
"hidden",
"barren",

Group 2 [special taxonomic static i.e. incertae-sedis-like]:

"major_rank_conflict",
"unclassified", 
"unplaced",
"incertae_sedis",

Group 1 [not really proper OTUs]:

"merged", "inconsistent",
"was_container", 
"environmental",
"hybrid", 
"not_otu",

Note that a single taxon can belong to multiple groups. Let P = the priority of the lowest priority group for taxon A, and Q = the priority of the lowest priority group for taxon B. If P < Q, then A appears lower than B in the taxon list. If P = Q, then exclude that group from consideration, and compare A and B in the same way using the priorities of the remaining groups. If all priorities are the same, then compare A and B alphabetically by name.

This can be implemented by computing a bit mask for the three groups:

mask = {if in group 3, then 4 else 0}
     + {if in group 2, then 2 else 0}
     + {if in group 1, then 1 else 0}

Then do a lexicographic comparison of the mask and the name. (Python is good at this kind of thing.)

The following flags should not be considered in sorting, but could still be displayed. (For each of these flags, either every child will have it, or none will.)

"major_rank_conflict_inherited", 
"unclassified_inherited", 
"unplaced_inherited",
"incertae_sedis_inherited", 
"extinct_inherited",
jar398 commented 8 years ago

I'll resist temptation and wait for review & encouragement (or a volunteer) before implementing this.

kcranston commented 8 years ago

In the short term, I like seeing the taxa explicitly grouped based on some sort of categorization (instead of a single list with an undescribed sort order). I don't object to the current categories - the flags help us decide what goes in the tree, so that grouping makes sense. Alternately, we could just use unflagged and flagged with a short note on the page about what can cause a taxon to be flagged? If we are going to sort based on different groups of flags, then I would prefer to name these groups to help people parse how certain flags are similar.

Longer term, I think we need:

  1. A different way of presenting information on the page (a sortable table, or a way of filtering the list)
  2. Better documentation about the meaning and usage of the OTT flags, and perhaps a reduction in the number of flags.
kcranston commented 8 years ago

I don't think we can expect users to parse all of the flags into their own categories - there are just too many of them.

jar398 commented 8 years ago

At the very least, if we keep the two current groups, the headings have to be changed. The categories actually have to do with inputs to synthesis, not what is in the tree. You can have a taxon in the included-in-synthesis category which does not end up in the tree because it is inferred to be paraphyletic - making the current heading a lie.

Having to change the taxonomy browser when the synthesis excluded-flag set changes seems strange to me, and I think it would muddy user understanding of the architecture. It also makes it impossible to simultaneously serve multiple synthetic trees made with different excluded-flag sets. I would rather there be no reference to synthesis in the taxonomy browser. This leaves the problem, if a two-category scheme is kept, of what to call the two categories, but that can probably be solved.

The in/not-in synthesis categorization is useless once one descends under a suppression flag, because all descendants will have some '-inherited' flag and the 'is an input to synthesis' category will be empty. Probably if either group is empty, the heading should simply be "Children".

We could certainly fold together all the incertae-sedis-like categories for presentation purposes. I think that would be an improvement.

"Flagged" is not meaningful as a category because many flags are semantically benign (e.g. edited, sibling_lower).

Dividing into eight groups as I proposed, rather than two, does not require that the groupings be undocumented. The eight groups could even have headings in the children list, just as the current two groups do.

There are many other options along these lines. For example, we could have three categories instead of two, e.g. not-otu, incertae-sedis-like-and-not-not-otu, neither.

mtholder commented 8 years ago

I agree with @jar398 that the current headings should change, and with @kcranston that (longer term) we are going to need to help users with understanding the flags rather than just listing them with the link to the doc.

FWIW I think that a reasonable presentation would be three groups. We could say (on the synth tree browser) that we only include the taxa in the first group which are not flagged as "hybrid", "viral", or "extinct" in the current version of the synthetic tree. The 3 groups would be:

  1. Placed children. Any child that lacks a flag mentioned below.
  2. incertae sedis-like taxa. The are taxa (ie. lacking any flag from the next bullet item that would make it this "Non-taxa" name) which are flagged with something that means the same thing as incertae sedis: "major_rank_conflict", "unclassified", "unplaced", and "incertae_sedis"
  3. Non-taxa: Names which have been flagged as not being real taxa during the construction of OTT, but would be placed as children of this parent taxon if they were legitimate (we should not use the word "legitimate" though, as it has a particular meaning in nomenclature). I think these flags would be: "merged", "inconsistent", "was_container", "environmental", "not_otu", and "barren".
jar398 commented 8 years ago

I'm fine with something similar to what Mark suggests.

"hidden" needs to be treated similarly to "extinct".

We could argue about whether the "barren" taxa are legit - in principle every "barren" taxon contains some described species, and the problem is not with the name per se but with our lack of knowledge about it. But I would defer on this, I don't think it's very important.

jar398 commented 8 years ago

See this thread started by Tony Rees: https://groups.google.com/d/msg/opentreeoflife/Bp61NfjhsT8/TFfQ9DX8BAAJ Maybe should be alphebetical or rank-then-alphabetical.