OpenTreeOfLife / opentree

Opentree browsing and curation web site. For overarching or cross-repo concerns, please see the 'germinator' repo.
http://tree.opentreeoflife.org/
BSD 2-Clause "Simplified" License
109 stars 26 forks source link

Ability for client code to request arbitrary-sized trees through a "big tree protocol" #439

Open jeetsukumaran opened 10 years ago

jeetsukumaran commented 10 years ago

A.k.a. the 2-year old macaque problem.

(1) requests that exceed or over-stress the load capacity of the server should fail, with an error message indicating that client code could repeat request using the "big tree transport protocol"

(2) the big tree transport protocol -- server-side chunking and client-side assembly of tree

rhr commented 10 years ago

Ok, I'll bite - 2-year old macaque?

Incidentally I'm working on an API of this sort for the tree illustrator app - basically a backend protocol to support zooming and panning a big tree visualization without storing it all client-side. The problem for tree viz is trickier than simply having a way of saying, e.g., 'give me at most N descendant nodes with a max depth of M from node X'. The core problem is more like, 'give me the info I need to render a thumbnail of clade X', where clade X might be very large.

chinchliff commented 10 years ago

I quote Jeet:

"I know some two year olds who can climb trees! ... Long-tailed macaques, for instance..."

On Tuesday, September 16, 2014, Rick Ree notifications@github.com wrote:

Ok, I'll bite - 2-year old macaque?

Incidentally I'm working on an API of this sort for the tree illustrator app - basically a backend protocol to support zooming and panning a big tree visualization without storing it all client-side. The problem for tree viz is trickier than simply having a way of saying, e.g., 'give me at most N descendant nodes with a max depth of M from node X'. The core problem is more like, 'give me the info I need to render a thumbnail of clade X', where clade X might be very large.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/439#issuecomment-55811361 .

jar398 commented 10 years ago

The big tree transport protocol should be simple FTP-like transfer of the whole tree. We should provide tools that people can run on their own machine to extract subtrees from that. If they want to put a caching layer on top of that, fine.

Of interest on this topic: http://www.gnu.org/philosophy/who-does-that-server-really-serve.html

On Tue, Sep 16, 2014 at 3:08 PM, Jeet Sukumaran notifications@github.com wrote:

A.k.a. the 2-year old macaque problem.

(1) requests that exceed or over-stress the load capacity of the server should fail, with an error message indicating that client code could repeat request using the "big tree transport protocol"

(2) the big tree transport protocol -- server-side chunking and client-side assembly of tree

  • to avoid computational load, either on first request or at start-up, large subtrees can be cached
  • some mechanism of "paging" a tree, e.g., server can return a list of lists of nodes to be retrieved

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/439.

jeetsukumaran commented 10 years ago

"The big tree transport protocol should be simple FTP-like transfer of the whole tree"

I disagree.

Strongly.

It is true that, in principle, this "download-and-do-it-yourself" approach could fulfill all possible things people might want OpenTree for, and not just when things stress the server. But I think that there is something inherently unsatisfying about this solution, both in principle as well in practice.

I think that solving the big tree transport protocol is interesting in and of itself. And I think it has use cases that extend beyond someone pulling data from OpenTree, to the point that even this existence and documentation of and experience with this approach itself would be a useful contribution of OpenTree to the community as much as any data? I guess that I see OpenTree as doing more than just serving data. It is also platform to explore (and solve) issues that arise with big data and big trees. And communicating/transporting big trees is a prime example of a big data / big tree issue.

jeetsukumaran commented 10 years ago

Of course, it make absolute sense that this issue is simply not an immediate or even near-future priority, or even something that the core team wants to handle (i.e., maybe in another hackathon)? I just do not think it should be taken off the table, or the "ftp-and-DIY" approach is a spiritually-satisfying solution to this issue.

josephwb commented 10 years ago

Although we've talked about the alternatives (e.g. chunking up the tree), I think being able to send compressed (complete) trees will solve almost all of the problems here. The full newick tree compresses at ~80 MB. People are not likely to want the full tree, so we are probably dealing with much smaller file sizes.

I've raised the issue of returning gzipped data, but reliance on neo4j plugins has somewhat hamstrung us on this issue. I know there ae people at the hackathon that are more familiar with neo4j than some of us (i.e. me), so it would be great if they could chime in on this.

mtholder commented 10 years ago

hmm. I don't like disagreeing with @jeetsukumaran when he disagrees strongly with something... I don't like disagreeing with Richard Stallman either. What to do?...

If we are going to encourage downloading the whole tree and using client side operations, then we might need to post a version of the synthetic tree with node IDs in it.

I'm assuming the typical work flow, if you want a complete subtree is:

  1. pick some taxa that span the part of the tree you want.
  2. use tnrs to convert them to ott_ids (if you started with names)
  3. ask tree_of_life/mrca for the mrca node ID.
  4. request the subtree,
  5. sadly discover that you are in the zone of big tree transport protocol

Our services like /tree_of_life/mrca return an ott_id for an internal node, if there is one. But client's can't rely on it being there (lots of internal nodes do not map to a node in the taxonomy). The node "ids" will always be returned, so it seems like client most clients would rely on that.

So, the client-side processing code would need to:

  1. grab the full tree,
  2. on demand prune out a subtree starting at node 358610395 (or whatever the mrca_node_id returned from mrca was).

Obviously that is only possible if the dumped version of the full tree has the node "ids" in it.

Alternatively, we could augment the mrca response to include the ott_ids for two tip nodes chosen such that the mrca node can also be located as the mrca of those two tips. That would mean the client side processing would have to find the mrca, as it pieces together chunks. But that is not too hard...

Note: "ids" are in scare quotes because these numbers are ephemeral from version to version of the supertree. They are not stable, persistent identifiers

mjy commented 10 years ago

How about reindexing the whole tree in some other datastore, like elastic search, and going from there? I.e. don't kluge to get neo working, use a tool that is purposed for doing similar things?

I don't feel strongly about this suggestion ;).

M On Sep 17, 2014 5:49 AM, "Mark T. Holder" notifications@github.com wrote:

hmm. I don't like disagreeing with @jeetsukumaran https://github.com/jeetsukumaran when he disagrees strongly with something... I don't like disagreeing with Richard Stallman either. What to do?...

If we are going to encourage downloading the whole tree and using client side operations, then we might need to post a version of the synthetic tree with node IDs in it.

I'm assuming the typical work flow, if you want a complete subtree is:

  1. pick some taxa that span the part of the tree you want.
  2. use tnrs to convert them to ott_ids (if you started with names)
  3. ask tree_of_life/mrca for the mrca node ID.
  4. request the subtree,
  5. sadly discover that you are in the zone of big tree transport protocol

Our services like /tree_of_life/mrca https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-of-Life-APIs#mrca return an ott_id for an internal node, if there is one. But client's can't rely on it being there (lots of internal nodes do not map to a node in the taxonomy). The node "ids" will always be returned, so it seems like client most clients would rely on that.

So, the client-side processing code would need to:

  1. grab the full tree,
  2. on demand prune out a subtree starting at node 358610395 (or whatever the mrca_node_id returned from mrca was).

Obviously that is only possible if the dumped version of the full tree has the node "ids" in it.

Alternatively, we could augment the mrca response to include the ott_ids for two tip nodes chosen such that the mrca node can also be located as the mrca of those two tips. That would mean the client side processing would have to find the mrca, as it pieces together chunks. But that is not too hard...

Note: "ids" are in scare quotes because these numbers are ephemeral from version to version of the supertree. They are not stable, persistent identifiers

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/439#issuecomment-55871808 .

josephwb commented 10 years ago

I don't feel that the reconstruction of the tree (i.e. the graph traversal) is a bottleneck, but we could alleviate this a bit by baking subtrees into the graph. At the extreme, we could store the descendant newick string at each node in the synthetic tree. Of course, this helps only with complete subtrees...

jeetsukumaran commented 10 years ago

Hmmm. I am not yet at the "Emacs" level of disagreement, but definitely past "pico".

How's this ...

The BTTP returns a newick string with a maximum nesting level of 1 where all nodes are actually full json strings with the following keys:

E.g.

  ({"node_id": "null", "ott_id": "1", "requery": "false"},({"node_id": 1, "ott_id": "null", "requery": "true"},{"node_id": "2" "ott_id": "null", "requery": "true"},))

Exactly one of ott_id or node_id will be valid, while the other will be "null".

If "requery_required" is True, then the json represents a placeholder for a full clade rooted at the specified "ott_id" or "node_id". Requerying OTT using the appropriate "ott_id" or "node_id" will retrieve another BTTP newick string that describes that clade. The process continues recursively for every node on the newick string where the json "requery" key is True. If "False", then this is a terminal, and that clade has been fully retrieved:.

chinchliff commented 10 years ago

Regarding neo4j Ids: I do not think there is a problem returning neo4j Ids from the web services, but I'm not crazy about the idea of zipping up a tree decorated with them and sticking it up somewhere else. I share Jonathan's so-called lock-stepping concerns here--we would also need to indicate which version of the tree the node Ids corresponded to, which would require nontrivial process management and likely cascading requirements for treemachine APIS.

Regarding graph traversals and server load, there are two issues: (1) the size of the payload (gzipping mostly solves this problem) and (2) memory requirement for constructing the tree objects pulled from the graph. We haven't quantified this, but it could potentially be very large. It would be wise not to anticipate that we will be able to handle many concurrent requests for very large trees. "Chunking" requests should reduce this concern.

I think Jeet's proposal of retrieving node-id decorated trees from the API and parsing/grafting them on the client side is a workable solution with little overhead that requires few changes to existing services and code, so I like it. The proposed tree representation, though, is getting dangerously close to the "arguson zone" (note scare quotes), which we have been trying to avoid because it introduces yet another arbitrary representation of trees (to the already diverse ecosystem including nexson, nexml, newick, nexus, etc). I wonder if there is a more standard way of doing this that would work.

On Wednesday, September 17, 2014, Jeet Sukumaran notifications@github.com wrote:

Hmmm. I am yet at the "Emacs" level of disagreement, but definitely past "pico".

How's this ...

The BTTP returns a newick string with a maximum nesting level of 1 where all nodes are actually full json strings with the following keys:

  • ott_id : long or "null"
  • node_id : long or "null"
  • requery_required : boolean

E.g.

({"node_id": "null", "ott_id": "1", "requery": "false"},({"node_id": 1, "ott_id": "null", "requery": "true"},{"node_id": "2" "ott_id": "null", "requery": "true"},))

Exactly one of ott_id or node_id will be valid, while the other will be "null".

If "requery_required" is True, then the json represents a placeholder for a full clade rooted at the specified "ott_id" or "node_id". Requerying OTT using the appropriate "ott_id" or "node_id" will retrieve another BTTP newick string that describes that clade. The process continues recursively for every node on the newick string where the json "requery" key is True. If "False", then this is a terminal, and that clade has been fully retrieved:.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/439#issuecomment-55882973 .

jeetsukumaran commented 10 years ago

The proposed tree representation, though, is getting dangerously close to the "arguson zone"

Not to mention that it probably breaks every NEWICK parser in existence and would require bit of custom coding to handle. Yep, that's bad.

Maybe the return value can be NexML or nexon, on the same principle as above (i.e., clades are json to be "unpacked" and grafted)?

If we can stomach another ad-hoc schema, perhaps a pure JSON description?

E.g. instead of:

({"node_id": "null", "ott_id": "1", "requery": "false"},({"node_id": 1, "ott_id": "null", "requery": "true"},{"node_id": "2" "ott_id": "null", "requery": "true"},)) 

Maybe BTTP with max_depth==1:

{
"clade_root" : {"node_id": "null", "ott_id": "null"},
"children" :[
    {
     "node_id": "null", 
     "ott_id": "1", 
     "requery": "false", 
     "children": []
    },
    {
     "node_id": "1",
     "ott_id": "null",
     "requery": "true",
     "children": []
    },
    ]
}

BTTP with max_depth==2:

{
"clade_root" : {"node_id": "null", "ott_id": "null"},
"children" :[
    {
     "node_id": "null", 
     "ott_id": "1", 
     "requery": "false", 
     "children": []
    },
    {
     "node_id": "1",
     "ott_id": "null",
     "requery": "true",
     "children": [
        {
        "node_id": "null", 
        "ott_id": "2", 
        "requery": "false", 
        "children": []
        },
        {
        "node_id": "3",
        "ott_id": "null",
        "requery": "true",
        "children": [
        ]
        },
     ]
    },
    ]
}
jimallman commented 10 years ago

On Sep 17, 2014, at 10:54 AM, Jeet Sukumaran notifications@github.com wrote:

Maybe the return value can be NexML or nexon, on the same principle as above (i.e., clades are json to be "unpacked" and grafted)?

Not sure if this is relevant, but at one time we discussed some ideas for uniform decomposition of NexSON into fragments which can be re-assembled piecemeal:

https://github.com/OpenTreeOfLife/phylesystem-api/blob/master/docs/README.md#nexson-fragments-and-decomposition

This doesn’t deal very gracefully with the separation of nodes, edges, and OTUs in NexSON, but maybe it’s a start...

=jimA=

Jim Allman Interrobang Digital Media http://www.ibang.com/ (919) 649-5760

chinchliff commented 10 years ago

The question of tree representation overlaps with https://github.com/OpenTreeOfLife/opentree/issues/443. See Mark's proposal for one way to encode metadata with newick trees.