Open kcranston opened 8 years ago
I lean towards the most liberal search approach - any overlap. I suppose that could get problematic as we have more trees, or with large or common taxa. People could get a lot of hits that are not what they are looking for. But I think best approach is return 'em all, and let the user decide.
FWIW, we faced a similar question in the Fossil Calibrations website, and our choice was to provide multiple "cladistic search" tools for trees containing:
Coming back to this issue, I find myself often wanting to ask the question "what trees might tell me something about the monophyly of taxon X" (for example, in investigating this PR about arachnids). What I want is "trees that contain at least X descendants of this taxon". Perhaps we need more than one method, or options on the method that allows for different ways of framing this query.
I don't know if this is possible, but I'd like to perform a query including all of the following:
Getting a tree with only one exemplar is not useful, especially if it is in the outgroup. Having a "phylogenetically informative" search option (minimum of 3 taxa for rooted trees) seems like it would be useful.
Sketch of implementation: In create_otu_table.py, you modify parent_closure
to maintain a counter for each ott id (from the study + all ancestors in OTT), and increment the counter at every step along the ancestor chain. So instead of returning a set, parent_closure
now returns a map from OTT id to count. The caller writes an additional column to the csv file it's preparing, and this becomes an additional column of some table.
In the find_trees method using OTT IDs or names, we return a tree in the results if there is at least one overlapping taxa in these two sets:
This ends up being very liberal. For example, when searching for trees matching Balaenopteridae, we get 14 matched trees. Ignoring two of the results (TimeTree, and a tree that appears to have been uploaded in error), we see:
Question for folks. What would you expect from this method? Trees with any amount of overlap? Only trees with more than one taxon? Only matches for taxa in the ingroup?
I am not sure how oti implemented this search, but the current v3 method seems to consistently return far fewer trees than the implementation here.