OpenTreeOfLife / taxomachine

taxonomy graphdb
Other
7 stars 4 forks source link

does the context for contextQueryForNames default to eudicots? #51

Closed mtholder closed 10 years ago

mtholder commented 10 years ago

seems so based on the response from

$ curl -X POST -H "Content-Type":"application/json" -H "Accept":"application/json" http://api.opentreeoflife.org/taxomachine/ext/TNRS/graphdb/contextQueryForNames --data '{"names": ["Nandina"]}'
josephwb commented 10 years ago

Hmm. I think inferring a context from one name alone is tricky. I see two problems with this one. First, "Plants" should be selected over "Eudicots". Second, how to handle the animal hit reported here? Here is another example:

curl -X POST -H "Content-Type":"application/json" -H "Accept":"applicatin/json" http://api.opentreeoflife.org/taxomachine/ext/TNRS/graphdb/contextQueryForNames --data '{"names": ["Drosophila"]}'

There is, of course, the friendly fruit fly (which is returned above), but also valid fungal taxa. Why is one selected over the other? If there is a tie (even a tie between a valid name and a synonym), I don't think a choice can be made. Further, rather than returning "Life" here, or even the MRCA of the taxon hits, some sort of message should be returned like "unable to infer context".

Do we expect people to use this call with only one taxon? Problems should dissipate exponentially with more and more taxa included in the query.

mtholder commented 10 years ago

sorry. I was not being clear. I see "context" : "Eudicots" in the response, which made me think that (even though the call specified no contextName), that was the context that was searched.

josephwb commented 10 years ago

I don't believe that contextName is a possible argument for contextQueryForNames. It is for autocompleteBoxQuery for sure. But the entire point of contextQueryForNames is to infer the context, so specifying a contextName doesn't make sense (to me).

kcranston commented 10 years ago

contextName is indeed an argument:

public Representation contextQueryForNames(
        @Source GraphDatabaseService graphDb,

        @Description("A comma-delimited string of taxon names to be queried against the taxonomy db. This is an alternative to the use of the 'names' parameter")
         @Parameter(name = "queryString", optional = true) String queryString,
        @Description("The name of the taxonomic context to be searched")
         @Parameter(name = "contextName", optional = true) String contextName,
mtholder commented 10 years ago

I was thinking of "context" as the echoing of the "contextName" arg. The behavior makes sense if tnrs is inferring context. though then there is still the issue of why it only returns the plant hit (or only the animal for drosophila)

chinchliff commented 10 years ago

To clarify: yes you can specify a context for the context query, in fact treemachine does this when importing trees, and the curator app allows the user to do so as well. It is significantly faster and produces more accurate results. If you do not specify a context, then a context is inferred based on so-called "exact hits", which are exact string matches to non-homonym taxa.

Synonyms are not considered valid taxon names and so they are not used when inferring the context. This could be changed, but it would basically just result in more queries defaulting to the "life" context which would produce a lot more load on the server (damn arthropods!). The current behavior is basically strict, in which case we identify what we think is the most likely name based on the available information. In the case of a single genus name, this is not much information. As Joseph mentioned, providing more names will improve the accuracy of the context inference. This is also another case where if you know you are searching for an animal/plant/fungus/whatever, then providing the proper context should solve the problem.

It's of course possible that there is just a bug. I will look into it today.

On Wednesday, June 18, 2014, Mark T. Holder notifications@github.com wrote:

I was thinking of "context" as the echoing of the "contextName" arg. The behavior makes sense if tnrs is inferring context. though then there is still the issue of why it only returns the plant hit (or only the animal for drosophila)

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/taxomachine/issues/51#issuecomment-46429541 .

mtholder commented 10 years ago

perhaps we just need a boolean argument "search_synonyms_when_inferring_context" (with a default False).

chinchliff commented 10 years ago

What is the use case for these single name searches? Just wondering if it wouldn't make more sense to provide a different service altogether. The context query is pretty heavy lifting and is not really intended to be used for one-name searches.

On Wednesday, June 18, 2014, Mark T. Holder notifications@github.com wrote:

perhaps we just need a boolean argument "search_synonyms_when_inferring_context" (with a default False).

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/taxomachine/issues/51#issuecomment-46441423 .

mtholder commented 10 years ago

OTU mapping. Obviously we often have a context, and we often have multiple tips. But they are not necessarily closely related. If we have a few names and no context, presumably we'd want to alert the curator to the fact that the label string matched multiple ott ids if there are homonyms+synonyms (like the Nandina example)

chinchliff commented 10 years ago

It sounds like a query that does not attempt to infer the context (but could use one if provided), and provides an exhaustive set of matches to a single name would fit this case better than the context query. I can imagine other cases where people want to get all relevant matches for a single name. If we only enable fuzzy matching on these queries for contexts other than life, I think this would be faster and more efficient than the context query.

I'm also thinking it would make sense to require at least two names for the context query.

How does this sound?

On Wed, Jun 18, 2014 at 10:55 AM, Mark T. Holder notifications@github.com wrote:

OTU mapping. Obviously we often have a context, and we often have multiple tips. But they are not necessarily closely related. If we have a few names and no context, presumably we'd want to alert the curator to the fact that the label string matched multiple ott ids if there are homonyms+synonyms (like the Nandina example)

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/taxomachine/issues/51#issuecomment-46446370 .

mtholder commented 10 years ago

second part (requiring more than 1 name to infer a context) sounds OK, I guess. I'm happy to use a different query for this if you point one out to me.

but we do want to support fuzzy matching on diverse sets of taxa, right? If you have a tree that has a bacterium and a eukaryote, the "All Life" is the only context that you can use, right?

chinchliff commented 10 years ago

Right, the context query will still do fuzzy matching for any name that it can't find an exact match for, no matter what the context. I was just suggesting limiting the fuzzy matching to more specific contexts in the proposed single-name query. I suppose we could just not impose that limit for now and if we need to we can later.

On Wed, Jun 18, 2014 at 11:33 AM, Mark T. Holder notifications@github.com wrote:

second part (requiring more than 1 name to infer a context) sounds OK, I guess. I'm happy to use a different query for this if you point one out to me.

but we do want to support fuzzy matching on diverse sets of taxa, right? If you have a tree that has a bacterium and a eukaryote, the "All Life" is the only context that you can use, right?

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/taxomachine/issues/51#issuecomment-46451986 .

chinchliff commented 10 years ago

Regarding the original issue, there were multiple problems: 1. the synonyms were missing from the db. 2. some results were being overwritten when there were synonym and non-synonym matches to a name. I think both problems have been fixed. For now, let's continue using contextQueryForNames, in the hope that (now that it should be working properly) it will be sufficient. If it still isn't working out we can explore the new query option.

curl -X POST -H "Content-Type":"application/json" -H "Accept":"application/json" http://devapi.opentreeoflife.org/taxomachine/ext/TNRS/graphdb/contextQueryForNames --data '{"names": ["Nandina"]}'
{
  "governing_code" : "undefined",
  "unambiguous_name_ids" : [ "Nandina" ],
  "unmatched_name_ids" : [ ],
  "matched_name_ids" : [ "Nandina" ],
  "context" : "All life",
  "includes_deprecated_ids" : false,
  "includes_dubious_names" : false,
  "taxonomy" : {
    "author" : "open tree of life project",
    "weburl" : "https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-Taxonomy",
    "source" : "ott2.8"
  },
  "results" : [ {
    "id" : "Nandina",
    "matches" : [ {
      "is_deprecated" : false,
      "dubious_name" : false,
      "is_synonym" : false,
      "flags" : [ ],
      "is_perfect_match" : true,
      "search_string" : "nandina",
      "score" : 1.0,
      "is_approximate_match" : false,
      "is_homonym" : false,
      "matched_ott_id" : 681927,
      "matched_node_id" : 4205837,
      "rank" : "",
      "matched_name" : "Nandina",
      "unique_name" : "Nandina (genus in subfamily Nandinoideae)",
      "nomenclature_code" : "ICN",
      "synonym_or_homonym_status" : "known"
    }, {
      "is_deprecated" : false,
      "dubious_name" : false,
      "is_synonym" : true,
      "flags" : [ "EDITED", "SIBLING_LOWER" ],
      "is_perfect_match" : false,
      "search_string" : "nandina",
      "score" : 1.0,
      "is_approximate_match" : false,
      "is_homonym" : false,
      "matched_ott_id" : 160620,
      "matched_node_id" : 3427360,
      "rank" : "",
      "matched_name" : "Labeo",
      "unique_name" : "Labeo (genus in superfamily Cyprinoidea)",
      "nomenclature_code" : "ICZN",
      "synonym_or_homonym_status" : "known"
    } ]
  } ]
}

And Joseph's Drosophila test case:

curl -X POST -H "Content-Type":"application/json" -H "Accept":"applicatin/json" http://devapi.opentreeoflife.org/taxomachine/ext/TNRS/graphdb/contextQueryForNames --data '{"names": ["Drosophila"]}'
{
  "governing_code" : "undefined",
  "unambiguous_name_ids" : [ ],
  "unmatched_name_ids" : [ ],
  "matched_name_ids" : [ "Drosophila" ],
  "context" : "All life",
  "includes_deprecated_ids" : false,
  "includes_dubious_names" : false,
  "taxonomy" : {
    "author" : "open tree of life project",
    "weburl" : "https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-Taxonomy",
    "source" : "ott2.8"
  },
  "results" : [ {
    "id" : "Drosophila",
    "matches" : [ {
      "is_deprecated" : false,
      "dubious_name" : false,
      "is_synonym" : false,
      "flags" : [ "EDITED", "SIBLING_LOWER", "SIBLING_HIGHER", "TATTERED" ],
      "is_perfect_match" : false,
      "search_string" : "drosophila",
      "score" : 1.0,
      "is_approximate_match" : false,
      "is_homonym" : true,
      "matched_ott_id" : 5554385,
      "matched_node_id" : 1842811,
      "rank" : "",
      "matched_name" : "Drosophila",
      "unique_name" : "Drosophila (genus in family Drosophilidae)",
      "nomenclature_code" : "ICZN",
      "synonym_or_homonym_status" : "known"
    }, {
      "is_deprecated" : false,
      "dubious_name" : false,
      "is_synonym" : false,
      "flags" : [ ],
      "is_perfect_match" : false,
      "search_string" : "drosophila",
      "score" : 1.0,
      "is_approximate_match" : false,
      "is_homonym" : true,
      "matched_ott_id" : 34907,
      "matched_node_id" : 1838729,
      "rank" : "",
      "matched_name" : "Drosophila",
      "unique_name" : "Drosophila (genus in Drosophiliti)",
      "nomenclature_code" : "ICZN",
      "synonym_or_homonym_status" : "known"
    }, {
      "is_deprecated" : false,
      "dubious_name" : false,
      "is_synonym" : true,
      "flags" : [ "SIBLING_LOWER" ],
      "is_perfect_match" : false,
      "search_string" : "drosophila",
      "score" : 1.0,
      "is_approximate_match" : false,
      "is_homonym" : false,
      "matched_ott_id" : 5344841,
      "matched_node_id" : 922873,
      "rank" : "",
      "matched_name" : "Psathyrella",
      "unique_name" : "Psathyrella",
      "nomenclature_code" : "ICN",
      "synonym_or_homonym_status" : "known"
    } ]
  } ]
}