ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 53 forks source link

Define Collection methods #221

Closed rishidev closed 5 years ago

rishidev commented 5 years ago

Some points to discuss here: How similar/different to object methods? Distinguished endpoints for each or different?

briandoconnor commented 5 years ago

@susheel and @delagoya are interested in working on this

@sarpera and SBG is interested in folders and how they relate to this

tetron commented 5 years ago

From the perspective of CWL / WES, one thing that needs be clarified is how a DRS collection is intended to be materialized to a directory. When a DRS collection is used as input to a workflow:

delagoya commented 5 years ago

Thanks @tetron, this is a good list to start with. Before we discuss each point, I want to confirm some agreements that were proposed during the Boston meeting:

  1. That collections are immutable
  2. Collections are not versioned
  3. That you can have sub-collections
  4. That a GET operation on a collection returns the immediate children (e.g. sub-collections require additional API operations to unroll).
tetron commented 5 years ago
  1. yes
  2. yes

3 & 4. that's what we agreed to, but I don't think we fully explored the alternatives.

tetron commented 5 years ago

This is the representation of a bundle in the current spec:

object_ids: [
  "drs://object1",
  "drs://object2",
  "drs://bundle3",
]

In order to materialize these into a directory, you would use the name field of each object or bundle. In order for the collection to be immutable, the name field of each data object or bundle would also have to be immutable. Which means if you want to rename a data object, you have to create a copy.

An alternate representation, where the collection assigns names to objects:

contents: {
  "/foo.bam": "drs://object1",
  "/foo.bai": "drs://object2",
  "/subdir/bar.bam": "drs://object3",
  "/subdir/bar.bai": "drs://object4"
}

Instead of sub-collections, a collection represents the entire directory tree. This eliminates the possibility of circular dependencies by collections including themselves. Fetching the collection record gives you more information up front about the contents of the collection, without having to fetch each object record separately.

susheel commented 5 years ago

@delagoya - Yes I think that was what was discussed in Boston. 1 & 2 are corollaries of each other.

Thanks, @tetron. I agree with the new representation. Yes that is what I was going for with the example presented at the hackathon.

@tetron I agree about circular dependencies, but fully realising an entire collection, e.g. ENA Release 138 would result in over 1.5K files which would bloat up the object list and end up requiring pagination. Hence the proposal of defining sub-collections (3) and the possibility of unrolling the collection (4).

@tetron Unrolling the subcollection using a query param, e.g. ?expand=true will unroll the subcollection into the fully realised version.

tetron commented 5 years ago

Here's a third option, halfway between 1 & 2.

contents: {
  "foo.bam": "drs://object1",
  "foo.bai": "drs://object2",
  "subdir": "drs://bundle3",
}

The collection assigns names but '/' is disallowed. This still has the risk of circular includes, but (a) allows collections to assign their own names and (b) the collection record includes the names (without having to fetch each object separately to find out what the names are.)

susheel commented 5 years ago

@tetron I'm okay with that. So if we add the ?expand=true query parameter will we end up with:

contents: {
  "foo.bam": "drs://object1",
  "foo.bai": "drs://object2",
  "subdir/bar.bam": "drs://object3",
  "subdir/bar.bai": "drs://object4"
}
susheel commented 5 years ago

@tetron @delagoya If this is the only difference between data objects and collections. I'm keen to propose that objects will have an empty or null contents object. So you are able to have both collections and objects managed with the same data schema. Examples below:

{
  id: 'ABC123'
  name: 'Object'
...
  contents: {}
...
}
{
  id: 'XYZ890'
  name: 'Collection'
...
  contents: {
    "foo.bam": "drs://object1",
    "foo.bai": "drs://object2",
    "subdir/bar.bam": "drs://object3",
    "subdir/bar.bai": "drs://object4"
  }
...
}

Do you also want an additional flag in the combined schema e.g. has_parts: true? I'm not too keen on this as checking the has_parts flag and checking if contents is not null is basically the same operation.

tetron commented 5 years ago

I don't think we've established whether a collection has access methods. For some protocols (eg a site-specific nfs mount point) it might make sense, for others you have to grab individual objects.

tetron commented 5 years ago

A 4th option, you could also have a query parameter like "include_objects=true" to request that the object records are embedded in the collection record response. Then you don't have to access object records separately, and you have the name, metadata (size, checksum etc) immediately available.

delagoya commented 5 years ago

Currently the GET on drs://object1 may contain a name field which is used to define the write target. I assume that in the case of a collection mapping, that the client ignores the individual object's name field in preference for the defined target of the collection. Correct?

If so, to @susheel 's question about one data schema - there are possible schema clashes like above and I would like to keep them separate until we can work all the way through collection and object schema separately.

geoffjentry commented 5 years ago

I do not like the idea of embedding / in the middle of a name in order to create directory hierarchies.

Wouldn't sub-collections handle the case where one wants to map the objects in a bundle into a POSIX-style FS w/ a directory hierarchy?

susheel commented 5 years ago

@delagoya Yes, it enables a collection owner to define the naming scheme of the objects sperate from the actual name of the object.

@geoffjentry Yes sub-collections should handle this as the original example. The forward-slash embedding will only occur when ?expand=true is set. See example below:

GET drs://server.com/XYZ890

{
  id: 'XYZ890'
  name: 'Collection'
  ...
  contents: {
    "foo.bam": "drs://object1",
    "foo.bai": "drs://object2",
    "subdir": "drs://collection1",
  }
  ...
}

GET drs://server.com/XYZ890?expand=true

{
  id: 'XYZ890'
  name: 'Collection'
  ...
  contents: {
    "foo.bam": "drs://object1",
    "foo.bai": "drs://object2",
    "subdir/bar.bam": "drs://object3",
    "subdir/bar.bai": "drs://object4"
  }
  ...
}

OR GET drs://server.com/XYZ890?include_objects=true via @tetron

{
  id: 'XYZ890'
  name: 'Collection'
  ...
  contents: {
    "foo.bam": "drs://object1",
    "foo.bai": "drs://object2",
    "subdir": {
      "id": "collection1",
      "name": "subdir sub-collection",
      ...
      "contents": {
        "bar.bam": "drs://object3",
        "bar.bai": "drs://object4"
      }
      ...
    }
  }
  ...
}

My preference in order would be for GET drs://server.com/XYZ890?expand=true only for the sake of verbosity of the JSON response. However, I could be convinced to use object embedding if the community wants to go in this direction.

mattions commented 5 years ago

I'm not sure yet what was the decision on the attribute that gets revealed for a single object, but I think we should be careful on embedding the name in the collection.

I know for example that some of the Driver project do not want to reveal anything, not even the name. Only to an authorised user all the info can be revealed.

It is still fuzzy on how to do it, but I just want to make sure that this is known, and a decision is made taking this in account.

susheel commented 5 years ago

@mattions I understand. If the user needs access privileges to the view certain fields metadata then that will part of the access rights provided to the user out of band. The spec should not differentiate between a privileged or non-privileged user - that should be handled outside of the DRS scope.

delagoya commented 5 years ago

@susheel but to @mattions point, the key to the contents collection should be the DRS id, not the target name.

Also I think that the whole "unroll" concept is just adding confusion to the server/client interaction. In the Hackathon we spent quite a bit of time discussing this issue and came on the side of "server is simple, client expands sub-collections via additional API calls."

tetron commented 5 years ago

So my concern is latency, it is going to be very slow if a directory listing requires 10 or 100 or 1000 separate requests to fetch each object in the collection just to find out the name.

Object embedding seems like a clean solution:

{
  id: 'XYZ890'
  name: 'Collection'
...
  contents: {
    {"id": "drs://object1", "name": "foo.bam", ...},
    {"id": "drs://object2", "name": "foo.bai", ...},
    {"id": "drs://collection1", "name": "subdir", ...}
  }
...
}

The "expand" or "embed" option could have three options: none, shallow, deep. Where "none" only has the "id" field, "shallow" has the embeds for the immediate children but not any of the subcollections, and "deep" expands every subcollection.

@delagoya at the hackathon we discussed expanding subcollections, but not where the file names come from. Whether or not there is a "deep expand" option on subcollections we haven't answered the more basic question of whether getting names/sizes/access info for files a collection should take one request or many.

Regarding large lists and paging: I believe we need this for collections no matter what we do. Even with no expansion, someone could publish a collection that is a flat list of 100,000 objects.

susheel commented 5 years ago

@delagoya @mattions My main concern is latency and the need to make multiple requests to fully materialise a collection. A typical collection from EBI will have over 1000 file objects.

We could have both id and name as part of the contents list and have multiple options for the expand parameter where minimally we would just show the id and at the other have a deep expansion of the files.

susheel commented 5 years ago

Oops. Yep, what @tetron said :)

So it would look like:

GET drs://server.com/XYZ890 (default: ?expand=none)

{
  id: 'XYZ890'
  name: 'Collection'
  ...
  contents: {
    {"id": "drs://object1"},
    {"id": "drs://object2"},
    {"id": "drs://collection1"}
  }
  ...
}

GET drs://server.com/XYZ890?expand=shallow

{
  id: 'XYZ890'
  name: 'Collection'
  ...
  contents: {
    {"id": "drs://object1", "name": "foo.bam", ...},
    {"id": "drs://object2", "name": "foo.bai", ...},
    {"id": "drs://collection1", "name": "subdir", ...
      "contents": {
        {"id": "drs://object3", "name": "bar.bam", ...},
        {"id": "drs://object4", "name": "bar.bai", ...},
        {"id": "drs://collection2", "name": "subsubdir", ...},
      }
    }
  }
  ...
}

GET drs://server.com/XYZ890?expand=deep

{
  id: 'XYZ890'
  name: 'Collection'
  ...
  contents: {
    {"id": "drs://object1", "name": "foo.bam", ...},
    {"id": "drs://object2", "name": "foo.bai", ...},
    {"id": "drs://collection1", "name": "subdir", ...
      "contents": {
        {"id": "drs://object3", "name": "bar.bam", ...},
        {"id": "drs://object4", "name": "bar.bai", ...},
        {"id": "drs://collection2", "name": "subsubdir", ...
          "contents": {
            {"id": "drs://object5", "name": "baz.bam", ...},
            {"id": "drs://object6", "name": "baz.bai", ...},
          }
        }
      }
    }
  }
  ...
}

Is that a reasonable compromise?

susheel commented 5 years ago

@tetron @delgoya Are you happy for me to create a pull request out of this last comment

tetron commented 5 years ago

@susheel go for it

sarpera commented 5 years ago

@susheel @tetron do we assume DRS will provide information about a bundle consistently?

In other words, do we imagine DRS preserving the state of a bundle that represents a folder structure this way at any given time (by somehow storing that info)? Or it will dynamically re-create information about the nested files and folders in a directory, depending on its state at the time of the request for a bundle?

susheel commented 5 years ago

@sarpera For DRS v1, Collections will represent a static view of the nested file/folder hierarchy at the time of creation specified by the version and/or (created|updated) timestamp.

Dynamically updating collections could still be created by the service provider, by using a version=latest, which could either point to a static view or a dynamically updated view based on the service provider's preference.

This will be more in the realm of the version semantics of objects/collections.

geoffjentry commented 5 years ago

@sarpera @susheel I think we'd be doing ourselves a service if we stopped using concepts like directory/files, etc in our discussions - even if we know that some groups will use this to sit on top of directory/files. Framing the discussion in terms of POSIX-y filesystems will lead to artificial limitations of the API.

For instance, the question @sarpera is asking implies the potential for a standard directory concept. It's a pointer to some files, and those files can come, go, be edited, etc. Instead, the concept we've been discussing is really just "a pointer to an immutable set of bytes", albeit with some way of communicating the structure of those bytes (e.g. multiple independent subsets of bytes, perhaps some knowledge of the relationship between those subsets of bytes, etc). The latter is far more flexible.

susheel commented 5 years ago

@geoffjentry @sarpera I agree. I was just using the POSIX-y terminology to explain the static nature of the current spec.

sarpera commented 5 years ago

@geoffjentry you are totally right, thanks for pointing out. Our general take on bundles had also been to see them as "a pointer to an immutable set of bytes". During the hackathon, thanks to everyone's contribution, we figured it "could" be used to solve a practical problem where a directory is an input or an output for a CWL workflow.

Going towards the idea of using DRS urls in WES spec, this would help the drivers adapt and map the required use cases for running CWL using GA4GH APIs.

Suggestions from @susheel and @tetron do a great job providing a possible solution for that case, while not betraying the intended use case for bundles.

ddietterich commented 5 years ago

I disagree with @geoffjentry. I think we do our users a disservice by re-inventing the wheel. The file system hierarchy is what is understood and processed by existing tools and applications. What is the motivation to invent something different? I would like a clear set of use cases from drivers that motivate diverging from the collection == directory model.

geoffjentry commented 5 years ago

@ddietterich Example - in the past we've talked about providing pointers to e.g. BigQuery data. How does that fit into the framework of a hierarchical filesystem?

ddietterich commented 5 years ago

In its current form, DRS does not cover references to datasets or databases. I don't think we should try to jam that in to the file-based DRS APIs. I do hope we can think through how to add those types of objects moving forward.

susheel commented 5 years ago

@geoffjentry What would the id be in the BigQuery use-case? I don't think DRS v1 should attempt to model structured databases. It would become too unwieldy. I agree with the spirit of your suggestion though.

mattions commented 5 years ago

Also we should remember that we want to use DRS as inputs for WES. Now, one of the language that WES accepts is CWL, which, form version 1.0 onward, accepts folders as inputs.

So we need a way to say this bundle, when de-serialized, goes into this type of folder structure, otherwise we do not build this to solve one of the major use-case for which we are building DRS.

geoffjentry commented 5 years ago

@susheel I think your id example is at the heart of what I'm getting at. We look at id as being isomorphic with "file name", but it's really just a pointer to something which can be resolved to bytes. I'll concede the point which @ddietterich makes that for now such matters can be considered out of scope.

I do feel strongly that efforts like GA4GH, WDL, CWL, etc should be pushing people to think beyond the unix command line model of computing but I'm willing to wait for a different day to die on that hill :)

geoffjentry commented 5 years ago

@mattions You are correct as both supported languages of WES (WDL and CWL) do indeed provide a Directory type

My statement wasn't that we should not allow for the description of hierarchical structure, rather is was a suggestion to reframe the discussion to not lock in to just the traditional models. However as I suggested above I'm happy to tilt this windmill another day.

dglazer commented 5 years ago

Closing, now the #244 is merged