OpenTreeOfLife / otindex

opentree index using postgres and pyramid
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

behaviour of find_trees based on OTT names or OTT IDs #24

Open kcranston opened 8 years ago

kcranston commented 8 years ago

In the find_trees method using OTT IDs or names, we return a tree in the results if there is at least one overlapping taxa in these two sets:

This ends up being very liberal. For example, when searching for trees matching Balaenopteridae, we get 14 matched trees. Ignoring two of the results (TimeTree, and a tree that appears to have been uploaded in error), we see:

Question for folks. What would you expect from this method? Trees with any amount of overlap? Only trees with more than one taxon? Only matches for taxa in the ingroup?

I am not sure how oti implemented this search, but the current v3 method seems to consistently return far fewer trees than the implementation here.

snacktavish commented 8 years ago

I lean towards the most liberal search approach - any overlap. I suppose that could get problematic as we have more trees, or with large or common taxa. People could get a lot of hits that are not what they are looking for. But I think best approach is return 'em all, and let the user decide.

jimallman commented 8 years ago

FWIW, we faced a similar question in the Fossil Calibrations website, and our choice was to provide multiple "cladistic search" tools for trees containing:

kcranston commented 8 years ago

Coming back to this issue, I find myself often wanting to ask the question "what trees might tell me something about the monophyly of taxon X" (for example, in investigating this PR about arachnids). What I want is "trees that contain at least X descendants of this taxon". Perhaps we need more than one method, or options on the method that allows for different ways of framing this query.

josephwb commented 7 years ago

I don't know if this is possible, but I'd like to perform a query including all of the following:

  1. Studies that include 3 or more taxa from my clade of interest (say, oh, Aves)
  2. Those taxa should all be in the ingroup
  3. The trees should be well-curated (including "preferred tree" annotation)

Getting a tree with only one exemplar is not useful, especially if it is in the outgroup. Having a "phylogenetically informative" search option (minimum of 3 taxa for rooted trees) seems like it would be useful.

jar398 commented 7 years ago

Sketch of implementation: In create_otu_table.py, you modify parent_closure to maintain a counter for each ott id (from the study + all ancestors in OTT), and increment the counter at every step along the ancestor chain. So instead of returning a set, parent_closure now returns a map from OTT id to count. The caller writes an additional column to the csv file it's preparing, and this becomes an additional column of some table.