ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 53 forks source link

Improve support for containers that contain *lots* of Objects #286

Open dglazer opened 5 years ago

dglazer commented 5 years ago

To address DRS v1 PRC #4 - Depth Parameter

The GET /objects/{object_id} method, when used on a container, includes a ContentsObject array in its response. That will work well for most use cases, but won't scale to containers with lots of children (or nested children if using the optional expand=true query parameter).

Potential problems:

Potential solutions:

dglazer commented 5 years ago

I don't think this is an urgent issue -- we can prioritize based on feedback from real-world usage of v1.

ddietterich commented 5 years ago

Responses that can't complete due to time limitations could respond with the ACCEPTED and a retry. Since we have that mechanism, is time really an issue?

The result size seems like the difficult problem. max-depth is an understandable way to limit result size, but it might not be sufficient; result size is more about number of objects returned than the nesting depth.

delagoya commented 5 years ago

Bumping this topic to the top. Even at a single level of a Object hierarchy, there may be a response size limitation issue.

My vote is to add the max_depth and also pagination parameters (perhaps nextToken?):

/objects/{object_id}?expand=true&max_depth=2&next_page_token=41C63372E8F9 with sensible defaults for result_size (e.g. 1000)

This gets a bit weird for nested structures, though. I have no good suggestion for handling a tree result across a set of responses.

pgrosu commented 5 years ago

Angel, I think it's simpler than you think for a graph/tree query. As a user you would build a simple nested query with filters like this -- it would work the same for a bundle as well:

objects().first(10) {
  servers {
    include { query-service.terra.bio, query-service.verily.com }
  }
  id: {object_id}
  last_accessed_object: null
}

You would need an orchestrator query service that will accept this -- which can be inside the container or launch reactive query containers (or sidecar ones) to perform this function -- and then dispatch let's say first(5) to each one based on the example above. The response would be managed accordingly via retry with short-circuit/retry-policy pattern or a success. If you want to get the next ones, below is an example:

objects().next(10) {
  servers {
    include { query-service.terra.bio, query-service.verily.com }
  }
  id: {object_id}
  last_accessed_object {
    query-service.terra.bio : {last_object_id_retrieved_in_sequence_from_terra_server}
    query-service.verily.com : {last_object_id_retrieved_in_sequence_from_verily_server}
  }
}

David, this should scale even to containers that might want to manage their own dispatching.

ddietterich commented 5 years ago

I think we all understand the concept: start from P and get me the next N. As I see it, there are several difficulties with using that technique when walking a graph:

pgrosu commented 5 years ago

I still see the answers to these as quite simple. Let's go through each one:

  1. The same graph traversal should be guaranteed by a spanning tree of the graph, which gets initialized when the server is first brought online. As data is loaded that extends the tree. You can have introspective queries that build the equivalent hierarchical result-set back, and not require spanning trees. Regarding sharding, a global B-tree representing which compute-nodes have what ranges of the file/objects/bundles, would allow for distributed traversal. You can even augment it with a distributed hash table to quickly jump to the right server.

  2. The client does not need to maintain a graph structure. Objects are nodes, and edges are associations/relations among them which can be tagged. Bundles can be viewed as subtrees. Some nodes can be viewed as collapsed entities, if a higher-order traversal might be required.

  3. Again the client does not need to maintain state. Since a token is used, then the server knows which user/client accessed it last. This does not need to be a guarantee as a the client can provide context of the last accessed id (or sub-path), which can give context as from where to continue. A client can provide context on what has been accessed so that loops and repeated elements are avoided. Query servers can be accommodating as one might want, which can provide richness on how one performs queries. If queries can become states to initiate more fine-grained inspection of interested results, that would allow one to tailor their data-retrieval more naturally.

ddietterich commented 5 years ago

I continue to disagree.

  1. You assert that such an implementation can be made. I don't disagree. But we are not talking about greenfield implementations. We are talking about existing implementations that may not have the properties you wish for. 2 and 3. Why do you assume the client does not need the graph structure? I think we are trying to convey the graph to the client. If that is not the goal, then we are talking at cross purposes.
pgrosu commented 5 years ago

I tried to keep the client simple, but that does not preclude one from extending it for having a richer client with graph-caching. Any existing implementation would by default have data-access channels to its datastores. The data with high probability would increase over time. A control/orchestration layer to optimize the access to such data is something users I'm sure would be interested in, in order to expediently facilitate their their access when interfacing with such large data silos. Why did for instance the BAM format create the BAI format as well? Otherwise folks would be downloading whole BAM files just to subsequently select their region of interest. My approach is to allow for some form iteration. If we say, well the data repositories that we currently have would not allow for extending them with data-access enhancements -- to allow for optimal access of larger and ever-growing structural complexities -- then why would folks be motivated to even reach those repositories if they become cumbersome to use. GA4GH is supposed to be at the forefront of developing such standards, and encapsulate logic around current limitations for sharing data. It is even listed in the GA4GH​ ​Connect:​ ​A​ ​5-Year​ ​Strategic​ ​Plan for the Cloud​ ​Work​ ​Stream​ ​Vision that:

Its initial focus is on ‘bringing the algorithms to the data’, by creating standards for defining, sharing, and executing portable workflows. Standards under discussion include workflow definition languages, tool encapsulation, cloud-based task​ ​and​ ​workflow​ ​execution,​ ​and​ ​cloud-agnostic​ ​abstraction​ ​of​ ​data​ ​access.

All I was proposing was the algorithms to encapsulate the data with. Please help me understand what limitations are we facing now, since we worked on this since 2013 and I am sure that researchers would relish richer data exploratory capabilities and data access.

briandoconnor commented 3 years ago

Related to #323, #325, and #337, correct?