Synthesis tree names do not match those returned by TNRS

josephwb commented 10 years ago

Raised by @lukejharmon. For example, for tnrs/match_names a search might be:

curl -X POST http://devapi.opentreeoflife.org/v2/tnrs/match_names -H "content-type:application/json" -d '{"names":["Stellula calliope"], "context_name":"birds"}'

with result:

{
  "governing_code" : "ICZN",
  "unambiguous_name_ids" : [ "Stellula calliope" ],
  "unmatched_name_ids" : [ ],
  "matched_name_ids" : [ "Stellula calliope" ],
  "context" : "Birds",
  "includes_deprecated_taxa" : false,
  "includes_dubious_names" : false,
  "includes_approximate_matches" : true,
  "taxonomy" : {
    "author" : "open tree of life project",
    "weburl" : "https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-Taxonomy",
    "source" : "ott2.8"
  },
  "results" : [ {
    "id" : "Stellula calliope",
    "matches" : [ {
      "is_deprecated" : false,
      "is_synonym" : false,
      "flags" : [ ],
      "search_string" : "stellula calliope",
      "score" : 1.0,
      "synonyms" : [ "Stellula calliope" ],
      "is_approximate_match" : false,
      "ot:ottId" : 536234,
      "matched_node_id" : 3502581,
      "rank" : "",
      "matched_name" : "Stellula calliope",
      "unique_name" : "Stellula calliope",
      "is_dubious" : false,
      "nomenclature_code" : "ICZN",
      "ot:ottTaxonName" : "Stellula calliope"
    } ]
  } ]
}

whereas a query to tree_of_life/induced_subtree:

curl -X POST http://devapi.opentreeoflife.org/treemachine/ext/tree_of_life/graphdb/induced_subtree -H "content-type:application/json" -d '{"ott_ids":[536234, 267845, 666104]}'

gives the result:

{
  "subtree" : "(Stellula_calliope_ott536234,(Dendroica_ott666104,Cinclus_ott267845));",
  "ott_ids_not_in_tree" : [ ],
  "ott_ids_not_in_graph" : [ ],
  "node_ids_not_in_graph" : [ ],
  "node_ids_not_in_tree" : [ ]
}

That is, TNRS gives "Stellula calliope" (spaces included) while ToL gives Stellula_calliope_ott536234 (no spaces, and a trailing suffix).

The use case is: 1) have a character/trait matrix for some taxa 2) search for matched names of taxa, get ottIds 3) use ottIds to get a subtree on which to do analyses from the graph DB 4) problem: names on returned tree do not match names returned from TNRS query (possibly applied to the character/trait matrix)

There are a number of ways to deal with this, but because it might be a common (ubiquitous?) problem, I thought I'd log it here for rumination.

One way to deal with this is just work with ottIds alone during analyses, as names can always be retrieved afterwards.

Alternatively, it could easily be possible to return trees from the synthesis graph DB without ottId suffixes (that is, names that match those from the TNRS exactly). I feel this option is not optimal because 1) ottIds are useful for disambiguation of homonyms, and 2) being of a standard format ("_ott\d+"), they are easily stripped out if desired.

There are probably a bunch more ways to deal with this effectively. I'm sure these approaches will emerge from the various projects being undertaken at the hackathon.

karolisr commented 10 years ago

Genus_specific_ottXXXXXX makes every name unique and lets a client decide how to handle name collisions. It may even be cleaner to just return ott ids.

mtholder commented 10 years ago

I suppose the most flexible way to handle this would be have an argument that lets the client request a style of tip labeling. That seems like overkill, given that the _ott[0-9]+ suffix is pretty easy to strip. I suppose that we could add a node_labels option with a choice of name+ott_id | name | ott_id with the current (name+ott_id ) as the default.

One comment on the underscore: the newick (and nexus) convention is that _ maps to a space if the label is not in single quotes. So I think we can expect the client to recognized that the string Stellula calliope could occur as 'Stellula calliope' or Stellula_calliope when it is in newick. That is a bit of a pain, but both forms are certainly common "in the wild", so a robust newick parser should deal with both.

jar398 commented 10 years ago

The strings Genus_epithet_ott1234 are a kludge to stuff structured data into a Newick label. These should be parsed as early as possible in subsequent processing. Note that the entire labels, as opposed to OTT ids, are not stable across taxonomy revisions, as the primary name associated with an OTT id can change. So it's best to work with OTT ids, not names or labels.

The styling idea is one way to go. A set of flags (include ott id, include name, include rank, etc.) would be another way to go. Returning NexML or Nexson would be yet another way to deal with this. It's good to be aware of this, but will probably take time to work out as the current solution is good enough for now.

On Thu, Sep 18, 2014 at 3:40 AM, Mark T. Holder notifications@github.com wrote:

I suppose the most flexible way to handle this would be have an argument that lets the client request a style of tip labeling. That seems like overkill, given that the _ott[0-9]+ suffix is pretty easy to strip. I suppose that we could add a node_labels option with a choice of name+ott_id | name | ott_id with the current (name+ott_id ) as the default.

One comment on the underscore: the newick (and nexus) convention is that _ maps to a space if the label is not in single quotes. So I think we can expect the client to recognized that the string Stellula calliope could occur as 'Stellula calliope' or Stellula_calliope when it is in newick. That is a bit of a pain, but both forms are certainly common "in the wild", so a robust newick parser should deal with both.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/443#issuecomment-56004951 .

chinchliff commented 10 years ago

I think a robust medium-term solution that would not impact current workflows would be to provide an argument to designate the type of tip label. I don't think we need to be complex with this. I see three common use cases:

Ott Id only
Name + "_" + ott id
Name only

I agree with mark that spaces and underscores should be left to the client. In The future when we support formats other than newick this will be less of an issue, but it's straightforward for others to deal with in the mean time.

If we use option 2 as the default, then things stay consistent with the current behavior. It's very easy to return tip labels in the other formats and I think people will want all three, so I see value in providing them, and certainly no drawbacks.

On Thursday, September 18, 2014, Jonathan A Rees notifications@github.com wrote:

The strings Genus_epithet_ott1234 are a kludge to stuff structured data into a Newick label. These should be parsed as early as possible in subsequent processing. Note that the entire labels, as opposed to OTT ids, are not stable across taxonomy revisions, as the primary name associated with an OTT id can change. So it's best to work with OTT ids, not names or labels.

The styling idea is one way to go. A set of flags (include ott id, include name, include rank, etc.) would be another way to go. Returning NexML or Nexson would be yet another way to deal with this. It's good to be aware of this, but will probably take time to work out as the current solution is good enough for now.

On Thu, Sep 18, 2014 at 3:40 AM, Mark T. Holder <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

I suppose the most flexible way to handle this would be have an argument that lets the client request a style of tip labeling. That seems like overkill, given that the _ott[0-9]+ suffix is pretty easy to strip. I suppose that we could add a node_labels option with a choice of name+ott_id | name | ott_id with the current (name+ott_id ) as the default.

One comment on the underscore: the newick (and nexus) convention is that _ maps to a space if the label is not in single quotes. So I think we can expect the client to recognized that the string Stellula calliope could occur as 'Stellula calliope' or Stellula_calliope when it is in newick. That is a bit of a pain, but both forms are certainly common "in the wild", so a robust newick parser should deal with both.

— Reply to this email directly or view it on GitHub < https://github.com/OpenTreeOfLife/opentree/issues/443#issuecomment-56004951>

.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/443#issuecomment-56037150 .

mtholder commented 10 years ago

mulling this issue over led to a "newick-edge-node" format suggestion that I tacked onto a google doc here

josephwb commented 8 years ago

I believe this is fixed.

jimallman commented 8 years ago

I believe this is fixed.

I see that the original symptom remains (different names from the two original cURL calls). Are you saying (as others suggested above) that this difference is not really a problem, or that we're providing enough guidance that a careful API consumer can make sense of things?

jar398 commented 8 years ago

The names must be different, for species at least, because spaces aren't allowed in Newick labels.

And the OTT ids must be included because there may be homonyms and labels must be unique.

So I think the tree_of_life behavior is both correct and desirable. The application has to deal with it. It can parse out the OTT id and use it to get the taxon name via another service, or it can decode the Newick label quoting and underscores (hard, and runs the risk that there might have been an underscore in the original).

Alternatively, it can get arguson format. Newick is really not up to the job.

jimallman commented 8 years ago

Thanks, @jar398! Just wanted to be sure we aren't leaving any loose ends.

OpenTreeOfLife / opentree

Synthesis tree names do not match those returned by TNRS #443