OpenTreeOfLife / treemachine

Source tree graph database
Other
16 stars 6 forks source link

Never expose volatile node ids in the API #183

Closed jar398 closed 8 years ago

jar398 commented 9 years ago

@hlapp remarked on this problem when he reviewed the API a few months ago. I know we've talked about this before but I don't think it ever became a treemachine issue.

Neo4j node ids are not stable, and if one is squirreled away (e.g. in a bookmark or email message) and then used after an intervening tree update, it will go to a completely random place without any indication that there may be a problem. Because they are transient, neo4j node ids must be kept completely secret from the user, i.e. should never be exposed in the API.

I suggest the API should use, for references to exposed nodes, use an OTT id when a node has one, and a pair of OTT ids when it doesn't. The node designated by an id pair is the MRCA of the nodes having those two OTT ids (in whatever tree one happens to be looking at). The choice of the two OTT ids for the MRCA is arbitrary among all such pairs that have the same result. It is the case that every node will be covered by one or the other of these two cases, since all tips have OTT ids.

The user could still be surprised when the tree is updated, in the sense of landing on a node that subtends an unexpected tip or fails to subtend an expected tip, e.g. when there is a polytomy refinement, but the node in the later tree will still be very similar to the one in the earlier tree, and will certainly be preferable as a target to a completely randomly chosen node. There is no way to eliminate this 'jitter' completely, since internal nodes do not have any concisely expressed identity that is both useful and persistent, and there is no way in general to capture or predict a user's expectations. If the MRCA notation is documented then we can set expectations.

The syntax could be something like 1234+23456 or any other connecting character, and could be both interpreted and generated by both sides of the API. A node would have many 'ids' of this form.

See https://github.com/OpenTreeOfLife/feedback/issues/63

josephwb commented 9 years ago

I'll add this to the branch with service updates.

kcranston commented 9 years ago

We need to make sure that we communicate this change well to users of the APIs. Will we keep backwards compatibility of the existing behaviour? As concrete examples, the folks writing the rotl library are just about to publish, and the peyotl tutorial that @snacktavish wrote is being hosted by CIPRES. I don't want to break those implementations.

jimallman commented 9 years ago

@jar398's solution certainly makes sense to me, but I'll point out that the existing system was intended to fail gracefully. Since we have both tree_id and node_id, we should be able to load a previous tree (if possible) or to say "Sorry, that tree is no longer available."

This doesn't work as expected if we keep using the same tree id over multiple versions, which I believe was the case for a long time.

jar398 commented 9 years ago

I don't see how the existing system can fail gracefully, unless you mean that someone is free to check the tree id - which they are not required to do now, and can still do under the proposal. Under the proposal, if they can choose to ignore it (and clearly some do), they will get the corresponding node in any tree, rather than a 'wrong tree' error; and presumably that is what they wanted, unlike now, when they get a randomly selected node.

And I don't know what you mean by "this doesn't work as expected". The expectation we will set, which will be pretty clear from the syntax, is that the new designators mean taxa or mrcas of two taxa, no matter what tree is being browsed, no matter what the tree is called, no matter whether there's versioning or not, etc.

This is a fairly small change as far as clients are concerned - they will switch from using opaque, unstable ids to opaque stable ids (non-opaque if they understand what the '+' means). But I guess it is incompatible in that the new node designators are not integers. Because (and only because) of the syntax change it might make sense to make up a new parameter/field name such as node_designator. We can keep node_id for a transitional period while clients adjust to using stable designators.

Or we could convert strings x+y to integers somehow, and reuse node_id. That would be fully compatible with no change required of any client. It would be a terrible abuse of integers, though. (even ids are 2 * ott id, odd ids are bit-interleaving of the two ott ids, times 2, plus 1.)

I haven't thought about how this works with the URLs used by the UI; that's a separate question. I would expect it to be completely transparent, just tweaking the URL former to understand strings instead of integers. But neo4j node ids certainly don't belong there either. ​

hlapp commented 9 years ago

We need to make sure that we communicate this change well to users of the APIs. Will we keep backwards compatibility of the existing behaviour? As concrete examples, the folks writing the rotl library are just about to publish, and the peyotl tutorial that @snacktavish wrote is being hosted by CIPRES. I don't want to break those implementations.

Backwards compatibility is the bane of progress, and the reason why APIs should be versioned, and have an expectation of being sunset at some point in the future.

This is common fare. APIs of packages that a particular published R package depends on change all the time, and do sooner or later break packages, including published ones, if they are not maintained. I'm not sure why Open Tree should be expected to play by stricter rules than everyone else, in particular in science.

kcranston commented 9 years ago

I mostly want to make sure that we don't silently push this change, rather than officially versioning the API and communicating to users.

jar398 commented 9 years ago

Right. If we choose not to use the integer-encoding kludge (and I think in the long run it's a bad idea), let's start by adding the new node_designator field to requests and responses; that's a compatible change. Then separately we can version the API and remove the node_id field, so as to address the complaint in the issue title.

(Anyone want to suggest something better than node_designator?)

mtholder commented 9 years ago

mrca_of perhaps? initialisms are often a bit cryptic, but MRCA is very common in this field.

jar398 commented 9 years ago

I had in mind a field that was the ott id if there was one, or id+id if not. That is, a designator for the node. So it would not be appropriate to call the field 'mrca_of'.

jimallman commented 9 years ago

maybe node_locator or node_location?

mtholder commented 9 years ago

The name does not matter too much to me. FWIW: I don't really see what is wrong with:

mrca_of=2+4

meaning the node that is the MRCA of ott ID 2 and ott ID 4. While

mrca_of=2

meaning the node with ott ID 2 (which is also the MRCA of taxon 2, in some sense).

node_designator is fine and consistent w/ phylocode lingo, but it requires the user to know "designator" means "either 'id of' or 'mrca of' ". So it is not the most transparent name.

kcranston commented 9 years ago

+1 to mrca_of

jar398 commented 9 years ago

What is the most recent common ancestor of 2? What's the difference between a cow?

There is no mrca of 2 because there's nothing for it to be common among.

Users don't need to know that the string has anything to do with ott ids or mrcas. They can just treat it as an opaque string. Not that there's any harm in them knowing, but there is no burden of requiring them to know this detail, and no transparency issue because keeping them opaque would be just fine.

In the future we might have other kinds of designators that, like 2, are also not mrca expressions. ​ What are 2 and 2+4? They are designators (or identifiers, or indicators, or expressions, or...). Their role (in the API) is not to be mrcas, it is to designate.

josephwb commented 9 years ago

I may be missing something, but is there a reason we cannot have two potential arguments: ott_id or mrca_of (list of ott_ids)?

jar398 commented 9 years ago

I think there should be a 'node id' abstraction that clients can use without having to be aware of their syntax or semantics. That is, clients that would be happy with opaque node ids, should be able to treat them as such. The natural representation for such an id would be a string. It would not make sense for a client that didn't care to be forced to understand that sometimes they're strings and sometimes they're lists, and sometimes they're used with one parameter name and sometimes with another. All that most applications want (including the tree browser, which puts these things in URLs) is a way to identify nodes.

There's no reason not to have more meaningful parameters for finding nodes by mrca and so on, but that is a different application.

On Wed, Aug 5, 2015 at 1:46 PM, Joseph W. Brown notifications@github.com wrote:

I may be missing something, but is there a reason we cannot have two potential arguments: ott_id or mrca_of (list of ott_ids)?

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/183#issuecomment-128088319 .

josephwb commented 9 years ago

I don't really understand; I guess we can discuss during next meeting.

I pushed treemachine services to devapi which get rid of node_id (replacing with mrca_of where appropriate). I will send out tests and documentation later to show how they work (documentation for the services can be looked at using curl calls at the moment).

snacktavish commented 9 years ago

Like @josephwb, I'm confused by this suggestion. How would the node id strings transfer across different synthetic trees? I support the "mrca_of" approach as the easiest to understand, and "mrca_of" a single ott_id doesn't bother me either.

jar398 commented 9 years ago

The idea is to support data abstraction, i.e. that clients that don't need to know the representation shouldn't have to. So the abstraction is that of a 'stable node id' and to those clients it is just an opaque string. But in fact it would encode one of two situations:

  1. a node that has an OTT id,
  2. a node that does not have an OTT id, but is instead located using a pair of OTT ids, and taking their MRCA. The string representation could be an mapping of the set ottid + (ottid x ottid) to strings. An obvious mapping would be an integer (string) for an OTT id, and a string of the form N+N (or N&N or N*N or N%N or ...) for the pair-of-ids case. But it could be any other representation from which the two kinds of information could be distinguished and recovered, e.g. odd/even with bit interleaving, making every 'stable node id' a sequence of digits. But there is no reason to go that far.

I really don't like this "mrca_of" business because it leaks representation information to clients that shouldn't be bothered with it and forces them to distinguish the two cases when they just don't care. To them, the stable node id is just an opaque token.

Since this is an architectural change I wish you had put this out for review before implementing it; treemachine is not the only system component affected by it.

As for how the strings transfer across synthetic trees - well the id 123 would mean the node with ott id 123 in whatever tree was being referenced, and 123+456 would mean the mrca of 123 and 456 in whatever tree was being referenced. This ought to work OK a lot of the time; and as there is no way to have perfect transfer between trees, since trees are incompatibly different, "a lot of the time" is the best you can do, and therefore adequate.

On Wed, Aug 5, 2015 at 4:39 PM, Emily Jane McTavish < notifications@github.com> wrote:

Like @josephwb https://github.com/josephwb, I'm confused by this suggestion. How would the node id strings transfer across different synthetic trees? I support the "mrca_of" approach as the easiest to understand, and "mrca_of" a single ott_id doesn't bother me either.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/183#issuecomment-128141486 .

jimallman commented 9 years ago

This ought to work OK a lot of the time; and as there is no way to have perfect transfer between trees, since trees are incompatibly different, "a lot of the time" is the best you can do, and therefore adequate.

Since we can't guarantee the "same" node (by whatever criterion), I'd like node-id strings that are fairly transparently ottids or combinations, implying MRCA. These look good to me:

The parentheses seem like they could be cumbersome, so I lean toward the others.

snacktavish commented 9 years ago

That makes sense. Is the idea that nodes would be assigned a fixed id, e.g. the node that is the mrca of 123 and 456 may also be the mrca of 111 and 444. Would it just be arbitrarily assigned an id string using two of its descendants, or would both '123+456' and '111+444' map to that node? (This may be getting into phylocode territory) Edit: Ah, just re-read first comment ending: "A node would have many 'ids' of this form." So, yes, both would identify node.

josephwb commented 9 years ago

I still don't understand. :cold_sweat:

Forgive me for being naive, but:

  1. If a user wants a named node, they need to know the ottid
  2. If a user wants an unnamed node, they need to know at least 2 ottids of descendant taxa.

So, the user knows what situation they are in (they have to be). If that is the case (it might not; like I said, I do not understand the issue yet), then why would:

foo(node_identifier=[NNN or XXX+YYY]); // where the arg is parsed to determine meaning

be better than:

bar([ottid=NNN or mrca_of=XXX]); // where either arg can be passed (not both)

? Either way, the user has to provide the right number of identifiers. Right? As @mtholder mentioned, "mrca_of" is ubiquitous in our community, so I think it is appropriate for identifying an unnamed node. I feel like I am missing something important here...

In terms of implementation, the stuff on devapi was just something to play with and see what we like. It took nearly no time at all, so no problem if it gets pitched. The point was to get moving on this languishing issue.

Two more things that came up:

  1. If we get rid of node_ids, then it won't be possible to query unnamed nodes that do not appear in the synthetic tree. Imagine we have 50 trees supporting one rooting of mammals, and 50 others that support another. One resolution has the highest ranking tree, so it will be present in the synthetic tree. With the ability to query node_ids, it is possible to look at the conflicting resolution and see which trees support it. Maybe not important? But it came up and I thought I'd pass the idea along.
  2. We know that "mrca"s can differ when considering the synthetic tree vs. the taxonomy tree. Which would a user have in mind? Probably taxonomy, right? Do we need to consider both options?
jar398 commented 9 years ago

Remember we're just talking about replacing the current unstable neo4j node ids with something that has a chance of surviving a synthetic tree update. Perhaps they should be called 'locators' instead of 'ids' since there is no need and no purpose in trying to make them unique per node.

I would expect treemachine to be deterministic, for any invocation of the server, in its choice of locator for a node. But redoing synthesis, or perhaps even restarting the server, might lead to a different arbitrary choice.

If we wanted uniqueness and stability, then yes, we might implement phylocode, but that would be a major undertaking - we'd need a registry, and some way to match nodes in the synthetic tree to "definitions" in the registry. I'm just talking about a minimal enhancement that will allow us to be able to use node locators appearing in comments, URLs, etc. to locate nodes from one synthesis build to the next, which is something we definitely do not have now.

On Wed, Aug 5, 2015 at 6:02 PM, Emily Jane McTavish < notifications@github.com> wrote:

That makes sense. Is the idea that nodes would be assigned a fixed id, e.g. the node that is the mrca of 123 and 456 may also be the mrca of 111 and 444. Would it just be arbitrarily assigned an id string using two of its descendants, or would both '123+456' and '111+444' map to that node? (This may be getting into phylocode territory)

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/183#issuecomment-128162800 .

jar398 commented 9 years ago

On Wed, Aug 5, 2015 at 7:44 PM, Joseph W. Brown notifications@github.com wrote:

I still don't understand. [image: :cold_sweat:]

Forgive me for being naive, but:

  1. If a user wants a named node, they need to know the ottid
  2. If a user wants an unnamed node, they need to know at least 2 ottids of descendant taxa.

So, the user knows what situation they are in (they have to be). If that is the case (it might not; like I said, I do not understand the issue yet), then why would:

You are mistaken. The common cases are the same as the current use cases for neo4j node ids. When someone uses a neo4j node id, that is because they have been handed it by someone else, typically either an API call (e.g. arguson) or something that has received a node id indirectly (e.g. the feedback system). It is just an uninterpreted handle. So situation 1 is unrelated to the current issue; we already have a way to do #1 and would continue using it. Number 2 is untrue as well: currently if you want an unnamed node, you provide a node id - an opaque token that treemachine knows how to interpret. I'm talking about replacing one opaque token with another. Yes, treemachine will need to know the two OTT ids to locate the node (just as it now needs to know the mapping from neo4j node ids to node), but the user doesn't need to know that. In software engineering, this is known as data abstraction. https://en.wikipedia.org/wiki/Abstract_data_type

foo(node_identifier=[NNN or XXX+YYY]); // where the arg is parsed to determine meaning

be better than:

bar([ottid=NNN or mrca_of=XXX]); // where either arg can be passed (not both)

? Either way, the user has to provide the right number of identifiers. Right?

No, the user just has to provide an opaque string that can be interpreted by some service that knows how to interpret it. If they have two OTT ids (or more likely two node locators) and want their mrca, they would use a service that knows how to handle that case. That is a different case.

Syntactically, I would set a requirement that these locators must be useable within URIs (including CGI query strings) without escaping. That would rule out use of =, +, %, and various other characters.

It may be that the XXX+YYY idea is causing so much confusion that maybe these locators should be made to look syntactically opaque. This is easily done, as I said before. Let the single OTT id case be written with a terminal '1', and the double OTT id case be written with a terminal '2' and the digits interleaved. So OTT id 123 would become node locator 1231, and the mrca of OTT 123 and OTT 456 would be node locator 1425362. This would be completely plug-compatible with the current node ids - we change certain decimal strings with other decimal strings - so would require zero cooperation from clients. Only treemachine would have to change.

I'm not sure we should go this far.

As @mtholder https://github.com/mtholder mentioned, "mrca_of" is ubiquitous in our community, so I think it is appropriate for identifying an unnamed node. I feel like I am missing something important here...

Yes, I think you missing the idea of data abstraction

In terms of implementation, the stuff on devapi was just something to play with and see what we like. It took nearly no time at all, so no problem if it gets pitched. The point was to get moving on this languishing issue.

OK. It sounded as if you had made an incompatible change. And we have a history on this project of sticking with designs that have had no planning or review.

Two more things that came up:

  1. If we get rid of node_ids, then it won't be possible to query unnamed nodes that do not appear in the synthetic tree. Imagine we have 50 trees supporting one rooting of mammals, and 50 others that support another. One resolution has the highest ranking tree, so it will be present in the synthetic tree. With the ability to query node_ids, it is possible to look at the conflicting resolution and see which trees support it. Maybe not important? But it came up and I thought I'd pass the idea along.

And this facility is one we currently have? If so, then yes, we have a different locator stability problem, and I hadn't thought about it. Perhaps stable references to these things is a harder problem, and maybe it's even intractable - in which case they should be tied to a particular tree, to avoid accidental hits when these get saved across TAG generations. THe current situation where you can obtain a node id, save it, attempt to reuse it, and get something completely unrelated really has to be ended, one way or another.

  1. We know that "mrca"s can differ when considering the synthetic tree vs. the taxonomy tree. Which would a user have in mind? Probably taxonomy, right? Do we need to consider both options?

This issue is about treemachine, so all decisions are relative to whatever artifact(s) treemachine hosts. That would be the synthetic tree and/or TAG. The same locators could be used with taxomachine, and then they would be relative to one of the taxonomies that it hosts.

As I said, what the user has in mind is whatever they have in mind when they use neo4j node ids currently. We have had users get confused about OTT ids vs. neo4j ids, so perhaps we should have a syntactic way to distinguish OTT ids from node locators.

The alternative to "sloppy" mrca-style locators that might take you to a node near one someone might have intended, but not exactly to it, is "hard" locators that are totally specific to a particular synthetic tree (or TAG) instance, i.e. neo4j node id + synthesis id. Neither approach is "right"; they have different failure modes. "Hard" locators become unresolvable once the underlying tree goes away (as when we advance to a new TAG / synthesis). "Sloppy" locators will always be resolvable, but they may take you to a slightly different node than someone might have wanted, if the resolution or classification changes in that vicinity. Perhaps the "sloppy" locator idea is not a good one. But either of these alternatives is superior to the current system, where a single locator can have wildly different meaning in different trees, and there is no check to make sure that an id is only used with the tree it came from. The purpose of this issue is not to introduce a new way of talking about mrcas, it's about solving the current problem where node ids are not only unstable but can be wrong in a way that is not detected. All discussion on this issue should be directed at solving that problem.

Just exploring the design space, another option is to combine the two approaches. A locator would consist of a "sloppy" mrca-style locator plus a synthesis version id. If you try to use a locator with a synthesis other than the one it came from, you would get a warning that the synthesis you asked for isn't available any more, but that a corresponding node in the current synthesis is such-and-such.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/183#issuecomment-128185049 .

josephwb commented 9 years ago

Should: "the mrca of OTT 123 and OTT 456 would be node locator 1425362" be: "the mrca of OTT 123 and OTT 456 would be node locator 1234562"?

josephwb commented 9 years ago

Regardless, happy to do whatever people want.

jimallman commented 9 years ago

Just exploring the design space, another option is to combine the two approaches. A locator would consist of a "sloppy" mrca-style locator plus a synthesis version id. If you try to use a locator with a synthesis other than the one it came from, you would get a warning that the synthesis you asked for isn't available any more, but that a corresponding node in the current synthesis is such-and-such.

:+1:

josephwb commented 9 years ago

I started playing with the idea:

"Let the single OTT id case be written with a terminal '1', and the double OTT id case be written with a terminal '2' and the digits interleaved. So OTT id 123 would become node locator 1231, and the mrca of OTT 123 and OTT 456 would be node locator 1425362." [although I think the last thing should be "1234562"]

But don't we need some sort of delimiter? Are ottids not all the same length? If we need a delimiter, then which one? We have traditionally used '_', but I know people dislike that one.

jar398 commented 9 years ago

you would prefill with zeroes as needed.

123 + 4 = 123 + 004 = 102034 456 + 78901 = 0456 + 78901 = 708495061

The encoding doesn't matter. That you can establish a 1-1 correspondence between integers and pairs of integers is well established (Georg Cantor, 1870s). We shouldn't be talking details of the encoding until we've settled on general requirements.

jar398 commented 8 years ago

On hangout we agreed the ids would be chosen by propinquity and included in the tree input to tm-lite.

josephwb commented 8 years ago

This seems done.