OpenTreeOfLife / taxomachine

taxonomy graphdb
Other
7 stars 4 forks source link

improvements to contexts #134

Open chinchliff opened 8 years ago

chinchliff commented 8 years ago

Some feature requests from @uyedaj:

  1. Provide more contexts
  2. Support "synonyms" for contexts in the web services, e.g. for the "Land plants" context, also make this accessible using the official taxon name "Embryophyta", and not case sensitive (if it is case sensitive)
jar398 commented 8 years ago

In theory we only need one context for each nomenclatural code. From a UI point of view we seem to already have too many. I'm sure @uyedaj https://github.com/uyedaj is right but I would like to know why we need them - that would also tell me which ones to add. Examples would help.

Maybe it would be more useful to allow an arbitrary taxon to be used as a context? I don't know if that's possible in taxomachine, though.

mtholder commented 8 years ago

I agree with the idea that every higher taxon in OTT should be usable as a context. For the sake of efficiency, th implementation of that might require:

  1. merging the results over a few pre-calculated contexts (each of which are too small for the request)
  2. filtering the results from a pre-calculated context (which is too large). or
  3. merging then filtering

But those would all be implementation details, and not visible to (or confusing to) the client.

mtholder commented 8 years ago

While I understand the benefit of tolerating synonyms, it actually seems much cleaner to me to require 2 calls:

  1. get an OTT ID for your context (which would support synonyms)
  2. then use that OTT ID to specify the context.
chinchliff commented 8 years ago

Well, besides name disambiguation, the other advantage of contexts is that they limit the search space and are thus faster and provide better fuzzy matches. For example, "Felis domestica" (an invalid name for housecat) is a close fuzzy match to "Malus domestica" (apple). I note these are already separated by existing contexts but at least it illustrates the advantage of using more limited scope for fuzzy matching (and I will reiterate: the speed improvements for fuzzy matching could be significant).

As far as using any arbitrary taxon for contexts, this is theoretically possible to do that currently, but it would require quadratic space and runtime to store/build the indexes: each one includes entries for all the descendants of the specified taxon. That seems prohibitive. Mark's ideas seem promising.

In the mean time, adding a handful more contexts at shallow levels in the taxonomy could be helpful and would require almost no effort and only a moderate amount of disk space, but I'm not sure how many nor which taxa to use. Maybe @uyedaj could provide some thoughts.

jar398 commented 8 years ago

It's not quadratic, it's n log n. But I agree it's probably too big given the current prices for AWS instances.

Awaiting examples and/or criteria. They're not hard to add.

uyedaj commented 8 years ago

I don't have specific examples...I guess recently I was working with a cephalopod and a elasmobranch phylogeny. I was hoping that you could turn any higher taxon into a context, and then the user could just query whatever name they wanted (e.g. sharks, selachii, selachimorpha), get the ottid, and then use it as a context for querying tnrs.

Failing that, the standard textbook list of named animal clades would be useful. Some of these are already available, but others are not. e.g.:

Porifera, Ctenophora, Rotifera, Onychophora, Echinodermata, Brachiopoda, Bilateria, Lophotrochozoa, Ecdysozoa, Protostomes, Deuterostomes

Within larger groups, would be useful to have things like: Gastropoda, Bivalvia, Cephalopoda, Crustacea, Chondricthyes, Actinopterygii, Sarcopterygii, Coleoptera, Hymenoptera, etc.

This is by no means exhaustive. As Cody said, my main issue is not disambiguation but speed. Even querying OpenTree for ottids when the names are exact matches is slower than I would like it to be for large trees.

chinchliff commented 8 years ago

It's not quadratic, it's n log n.

It depends on the shape of the tree, right? If the tree were fully imbalanced it would be n^2. If the tree were balanced it would be n log n. But that does suggest that adding a lot of contexts might not actually be that bad. Especially if they were limited to a minimum level of inclusivity... E.g. it might not really make sense to add contexts for small taxa—they won't be much faster than slightly larger contexts and since most taxa are relatively small this could save a lot of space. But I'm not sure how imbalanced the taxonomy actually is.

Mark's suggestion certainly seems more space efficient (and time efficient for the initial indexing), but if it doesn't actually cost too much to just create lots of redundant indexes, that seems simpler and could potentially result in faster queries.