OpenTreeOfLife / germinator

miscellaneous scripts and data for concerns that span more than one of the Open Tree code repositories: integration tests, system statistics, etc.
BSD 2-Clause "Simplified" License
21 stars 7 forks source link

arguson no longer contains information used to label nodes that are not taxa #67

Closed kcranston closed 8 years ago

kcranston commented 8 years ago

The current version of arguson contains a field called descendant_name_list that contains a list of named descendant nodes that the client uses to build display names for nodes that are not taxa. For example, this node has the display label "[Asteraceae + Goodeniaceae + ...]".

In the v3 spec, arguson does not return any such information for non-taxon nodes. The only information in a non-taxon node blob is the id, which will have the form mrcaott####ott####. Can we build a display name from that node_id? Get the taxon names from the two OTT ids in the node_id string and display "MRCA of Name 1 and Name 2"? (I actually like this better than the current "name + name + ..." string). Other suggestions?

See gitter conversation that started this.

josephwb commented 8 years ago

I don't like this idea. The mrcaottXXXottYYY are terminals. Do we want something like "MRCA of Ambystoma tigrinum and Chauna torquata" for a node in the middle of Metazoa?

I think we want the current behaviour: the closest taxa that are in the tree.

mtholder commented 8 years ago

I agree with @josephwb that generating a name from the node_id will result in cryptic names. I believe that the rule used is something like "grab the lowest OTT ID assigned to a terminal from the first child and another lowest terminal OTT ID from the second child to create a node ID." So there is no reason to think that they would be particularly recognizable taxa.

It seems like adding descendant_name_list to the return value of v3 arguson would be reasonable. Or we could just have the spec note that the listing of fields in an arguson node is not necessarily a complete listing - i.e. we could include this as bit of data that is not publicly documented.

jar398 commented 8 years ago

I'm with @josephwb on this one. I understand why using more rootward names could be considered ugly or unprincipled, but the species names aren't going to help anyone and we so far have not come up with any labeling scheme of greater utility to the user.

The omission of that field from the spec was an oversight. The examples and description I was working from didn't contain the field.

A general disclaimer about the result possibly containing fields not defined by the spec is a good one to make, since in all API methods we want to have the option of compatibly adding fields without advancing the API version. We just have to be aware that people always reverse engineer these things, so someone might try to hold us to implemented features even if they're not documented. Therefore, once added, a result probably shouldn't be changed or removed without advancing to a new API version. And if this is the case, we might as well document it.

josephwb commented 8 years ago

Is there a definite plan here? Seems like we need is:

  1. If a terminal arguson node is a taxon, use the name field from the taxon_blob within the node_blob to get node label
  2. If a terminal arguson node is NOT a taxon, an additional label (or list of names from which to build a label) must be supplied.

For 2, this would have to go within the node_blob. So, some questions:

  1. What should the property be called (previously descendant_name_list)?
  2. What datatype should be provided (previously a list)?

What about non-terminal arguson nodes (e.g. within lineage)? Do these need labels other than mrcaottXXXottYYY @jimallman?

jar398 commented 8 years ago

My opinions 1.​ I don't know. descendant_name_list is not great, but it will do in a pinch. (It suggests value is names of all descendants, which it's not.) Maybe inverse_mrca or inverse_mrca_names ?

  1. the value should be a list of strings, as in v2, with a maximum length of 2. (different clients, or different versions of our client, might want to do different things with the names in order to determine presentation.)
josephwb commented 8 years ago

Why a a maximum length of 2? The old version did not have a limit; Jim just picked 2 (first and last?) to display.

jar398 commented 8 years ago

Because clients only need two, and in some cases there are many thousands of such names. Seems a waste to transmit vast amount of information that's not going to be used.

jar398 commented 8 years ago

Well, correction, I don't that I can prove there are such cases. There are taxonomy nodes with thousands of children, but I don't know there are non-taxon synthesis nodes with them. But I still don't see any reason to return more than are needed.​

josephwb commented 8 years ago

So, first and last, just first 2, something else?

jimallman commented 8 years ago

FWIW, I'd prefer to return a label field on node-blob (vs. name to avoid confusion with taxon names), with a pre-constructed label (format TBD).

Because clients only need two, and in some cases there are many thousands of such names.

Note that the client does currently show a different node label with a final ellipsis if there are more than two descedant tips, e.g., [Apterygiformes + Rheiformes + ...].

What about non-terminal arguson nodes (e.g. within lineage)? Do these need labels other than mrcaottXXXottYYY @jimallman?

Yes, these nodes need some kind of label information, since they can be examined in the synth-tree viewer.

jar398 commented 8 years ago

Ah, what do I know.

Maybe return two or three names, or maybe return a label. I just thought the name list gave more flexibility in case we wanted to change the notation on the client side. We really don't want people parsing these things. But I don't feel strongly.

jar398 commented 8 years ago

I believe the order of the children list is random, so I don't think it matters how it is sampled. If you want to go way beyond the call of duty, I'd suggest names of large taxa, since those are the ones mostly likely to be recognized by the user.

jimallman commented 8 years ago

To be clear, what I'm asking for is a label field (or equivalent) on any node-blob that doesn't include a taxon. This way we'll always have a sensible way to represent otherwise "unnamed" nodes.

jimallman commented 8 years ago

@jar398 points out that the label format for unnamed nodes should left to the arguson client to decide, so the earlier descendant_name_list (with up to three names) makes sense to me. We'd still choose the names to include, so it might not suit every labeling preference, but it seems like a good place to start.

@kcranston would prefer that we add this only in arguson views, versus in any node-blob without a mapped taxon, since it's (currently) only useful in argus.

josephwb commented 8 years ago

Do we want names returned here, or ottIDs? The latter would require @jimallman to make a taxonomy query.

jimallman commented 8 years ago

Good point! Please return names if possible.

jar398 commented 8 years ago

The evidence is that this is now fixed. Re-open if you think not. https://devtree.opentreeoflife.org/opentree/opentree4.1@mrcaott786ott112387/Dermoptera--Primates

jar398 commented 8 years ago

oops, maybe too soon, double checking tests and spec...

jar398 commented 8 years ago

test fixed. I put a comment in the spec saying what needs to be done to it.