Closed rishidev closed 5 years ago
@susheel and @delagoya are interested in working on this
@sarpera and SBG is interested in folders and how they relate to this
From the perspective of CWL / WES, one thing that needs be clarified is how a DRS collection is intended to be materialized to a directory. When a DRS collection is used as input to a workflow:
Thanks @tetron, this is a good list to start with. Before we discuss each point, I want to confirm some agreements that were proposed during the Boston meeting:
GET
operation on a collection returns the immediate children (e.g. sub-collections require additional API operations to unroll).3 & 4. that's what we agreed to, but I don't think we fully explored the alternatives.
This is the representation of a bundle in the current spec:
object_ids: [
"drs://object1",
"drs://object2",
"drs://bundle3",
]
In order to materialize these into a directory, you would use the name
field of each object or bundle. In order for the collection to be immutable, the name
field of each data object or bundle would also have to be immutable. Which means if you want to rename a data object, you have to create a copy.
An alternate representation, where the collection assigns names to objects:
contents: {
"/foo.bam": "drs://object1",
"/foo.bai": "drs://object2",
"/subdir/bar.bam": "drs://object3",
"/subdir/bar.bai": "drs://object4"
}
Instead of sub-collections, a collection represents the entire directory tree. This eliminates the possibility of circular dependencies by collections including themselves. Fetching the collection record gives you more information up front about the contents of the collection, without having to fetch each object record separately.
@delagoya - Yes I think that was what was discussed in Boston. 1 & 2 are corollaries of each other.
Thanks, @tetron. I agree with the new representation. Yes that is what I was going for with the example presented at the hackathon.
@tetron I agree about circular dependencies, but fully realising an entire collection, e.g. ENA Release 138 would result in over 1.5K files which would bloat up the object list
and end up requiring pagination. Hence the proposal of defining sub-collections (3) and the possibility of unrolling the collection (4).
@tetron Unrolling the subcollection using a query param, e.g. ?expand=true
will unroll the subcollection into the fully realised version.
Here's a third option, halfway between 1 & 2.
contents: {
"foo.bam": "drs://object1",
"foo.bai": "drs://object2",
"subdir": "drs://bundle3",
}
The collection assigns names but '/' is disallowed. This still has the risk of circular includes, but (a) allows collections to assign their own names and (b) the collection record includes the names (without having to fetch each object separately to find out what the names are.)
@tetron I'm okay with that. So if we add the ?expand=true
query parameter will we end up with:
contents: {
"foo.bam": "drs://object1",
"foo.bai": "drs://object2",
"subdir/bar.bam": "drs://object3",
"subdir/bar.bai": "drs://object4"
}
@tetron @delagoya If this is the only difference between data objects
and collections
. I'm keen to propose that objects
will have an empty or null contents
object. So you are able to have both collections
and objects
managed with the same data schema. Examples below:
{
id: 'ABC123'
name: 'Object'
...
contents: {}
...
}
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
"foo.bam": "drs://object1",
"foo.bai": "drs://object2",
"subdir/bar.bam": "drs://object3",
"subdir/bar.bai": "drs://object4"
}
...
}
Do you also want an additional flag in the combined schema e.g. has_parts: true
? I'm not too keen on this as checking the has_parts
flag and checking if contents
is not null is basically the same operation.
I don't think we've established whether a collection has access methods. For some protocols (eg a site-specific nfs mount point) it might make sense, for others you have to grab individual objects.
A 4th option, you could also have a query parameter like "include_objects=true" to request that the object records are embedded in the collection record response. Then you don't have to access object records separately, and you have the name, metadata (size, checksum etc) immediately available.
Currently the GET
on drs://object1
may contain a name
field which is used to define the write target. I assume that in the case of a collection mapping, that the client ignores the individual object's name field in preference for the defined target of the collection. Correct?
If so, to @susheel 's question about one data schema - there are possible schema clashes like above and I would like to keep them separate until we can work all the way through collection
and object
schema separately.
I do not like the idea of embedding /
in the middle of a name in order to create directory hierarchies.
Wouldn't sub-collections handle the case where one wants to map the objects in a bundle into a POSIX-style FS w/ a directory hierarchy?
@delagoya Yes, it enables a collection owner to define the naming scheme of the objects sperate from the actual name of the object.
@geoffjentry Yes sub-collections should handle this as the original example. The forward-slash embedding will only occur when ?expand=true
is set. See example below:
GET drs://server.com/XYZ890
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
"foo.bam": "drs://object1",
"foo.bai": "drs://object2",
"subdir": "drs://collection1",
}
...
}
GET drs://server.com/XYZ890?expand=true
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
"foo.bam": "drs://object1",
"foo.bai": "drs://object2",
"subdir/bar.bam": "drs://object3",
"subdir/bar.bai": "drs://object4"
}
...
}
OR GET drs://server.com/XYZ890?include_objects=true
via @tetron
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
"foo.bam": "drs://object1",
"foo.bai": "drs://object2",
"subdir": {
"id": "collection1",
"name": "subdir sub-collection",
...
"contents": {
"bar.bam": "drs://object3",
"bar.bai": "drs://object4"
}
...
}
}
...
}
My preference in order would be for GET drs://server.com/XYZ890?expand=true
only for the sake of verbosity of the JSON response. However, I could be convinced to use object embedding if the community wants to go in this direction.
I'm not sure yet what was the decision on the attribute that gets revealed for a single object, but I think we should be careful on embedding the name in the collection.
I know for example that some of the Driver project do not want to reveal anything, not even the name. Only to an authorised user all the info can be revealed.
It is still fuzzy on how to do it, but I just want to make sure that this is known, and a decision is made taking this in account.
@mattions I understand. If the user needs access privileges to the view certain fields metadata then that will part of the access rights provided to the user out of band. The spec should not differentiate between a privileged or non-privileged user - that should be handled outside of the DRS scope.
@susheel but to @mattions point, the key to the contents collection should be the DRS id, not the target name.
Also I think that the whole "unroll" concept is just adding confusion to the server/client interaction. In the Hackathon we spent quite a bit of time discussing this issue and came on the side of "server is simple, client expands sub-collections via additional API calls."
So my concern is latency, it is going to be very slow if a directory listing requires 10 or 100 or 1000 separate requests to fetch each object in the collection just to find out the name.
Object embedding seems like a clean solution:
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
{"id": "drs://object1", "name": "foo.bam", ...},
{"id": "drs://object2", "name": "foo.bai", ...},
{"id": "drs://collection1", "name": "subdir", ...}
}
...
}
The "expand" or "embed" option could have three options: none, shallow, deep. Where "none" only has the "id" field, "shallow" has the embeds for the immediate children but not any of the subcollections, and "deep" expands every subcollection.
@delagoya at the hackathon we discussed expanding subcollections, but not where the file names come from. Whether or not there is a "deep expand" option on subcollections we haven't answered the more basic question of whether getting names/sizes/access info for files a collection should take one request or many.
Regarding large lists and paging: I believe we need this for collections no matter what we do. Even with no expansion, someone could publish a collection that is a flat list of 100,000 objects.
@delagoya @mattions My main concern is latency and the need to make multiple requests to fully materialise a collection. A typical collection from EBI will have over 1000 file objects.
We could have both id
and name
as part of the contents list and have multiple options for the expand
parameter where minimally we would just show the id
and at the other have a deep expansion of the files.
Oops. Yep, what @tetron said :)
So it would look like:
GET drs://server.com/XYZ890
(default: ?expand=none
)
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
{"id": "drs://object1"},
{"id": "drs://object2"},
{"id": "drs://collection1"}
}
...
}
GET drs://server.com/XYZ890?expand=shallow
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
{"id": "drs://object1", "name": "foo.bam", ...},
{"id": "drs://object2", "name": "foo.bai", ...},
{"id": "drs://collection1", "name": "subdir", ...
"contents": {
{"id": "drs://object3", "name": "bar.bam", ...},
{"id": "drs://object4", "name": "bar.bai", ...},
{"id": "drs://collection2", "name": "subsubdir", ...},
}
}
}
...
}
GET drs://server.com/XYZ890?expand=deep
{
id: 'XYZ890'
name: 'Collection'
...
contents: {
{"id": "drs://object1", "name": "foo.bam", ...},
{"id": "drs://object2", "name": "foo.bai", ...},
{"id": "drs://collection1", "name": "subdir", ...
"contents": {
{"id": "drs://object3", "name": "bar.bam", ...},
{"id": "drs://object4", "name": "bar.bai", ...},
{"id": "drs://collection2", "name": "subsubdir", ...
"contents": {
{"id": "drs://object5", "name": "baz.bam", ...},
{"id": "drs://object6", "name": "baz.bai", ...},
}
}
}
}
}
...
}
Is that a reasonable compromise?
@tetron @delgoya Are you happy for me to create a pull request out of this last comment
@susheel go for it
@susheel @tetron do we assume DRS will provide information about a bundle consistently?
In other words, do we imagine DRS preserving the state of a bundle that represents a folder structure this way at any given time (by somehow storing that info)? Or it will dynamically re-create information about the nested files and folders in a directory, depending on its state at the time of the request for a bundle?
@sarpera For DRS v1, Collections will represent a static view of the nested file/folder hierarchy at the time of creation specified by the version
and/or (created|updated
) timestamp.
Dynamically updating collections could still be created by the service provider, by using a version=latest
, which could either point to a static view or a dynamically updated view based on the service provider's preference.
This will be more in the realm of the version
semantics of objects/collections.
@sarpera @susheel I think we'd be doing ourselves a service if we stopped using concepts like directory/files, etc in our discussions - even if we know that some groups will use this to sit on top of directory/files. Framing the discussion in terms of POSIX-y filesystems will lead to artificial limitations of the API.
For instance, the question @sarpera is asking implies the potential for a standard directory
concept. It's a pointer to some files, and those files can come, go, be edited, etc. Instead, the concept we've been discussing is really just "a pointer to an immutable set of bytes", albeit with some way of communicating the structure of those bytes (e.g. multiple independent subsets of bytes, perhaps some knowledge of the relationship between those subsets of bytes, etc). The latter is far more flexible.
@geoffjentry @sarpera I agree. I was just using the POSIX-y terminology to explain the static nature of the current spec.
@geoffjentry you are totally right, thanks for pointing out. Our general take on bundles had also been to see them as "a pointer to an immutable set of bytes". During the hackathon, thanks to everyone's contribution, we figured it "could" be used to solve a practical problem where a directory is an input or an output for a CWL workflow.
Going towards the idea of using DRS urls in WES spec, this would help the drivers adapt and map the required use cases for running CWL using GA4GH APIs.
Suggestions from @susheel and @tetron do a great job providing a possible solution for that case, while not betraying the intended use case for bundles.
I disagree with @geoffjentry. I think we do our users a disservice by re-inventing the wheel. The file system hierarchy is what is understood and processed by existing tools and applications. What is the motivation to invent something different? I would like a clear set of use cases from drivers that motivate diverging from the collection == directory model.
@ddietterich Example - in the past we've talked about providing pointers to e.g. BigQuery data. How does that fit into the framework of a hierarchical filesystem?
In its current form, DRS does not cover references to datasets or databases. I don't think we should try to jam that in to the file-based DRS APIs. I do hope we can think through how to add those types of objects moving forward.
@geoffjentry What would the id
be in the BigQuery use-case? I don't think DRS v1 should attempt to model structured databases. It would become too unwieldy. I agree with the spirit of your suggestion though.
Also we should remember that we want to use DRS as inputs for WES. Now, one of the language that WES accepts is CWL, which, form version 1.0 onward, accepts folders as inputs.
So we need a way to say this bundle, when de-serialized, goes into this type of folder structure, otherwise we do not build this to solve one of the major use-case for which we are building DRS.
@susheel I think your id
example is at the heart of what I'm getting at. We look at id
as being isomorphic with "file name", but it's really just a pointer to something which can be resolved to bytes. I'll concede the point which @ddietterich makes that for now such matters can be considered out of scope.
I do feel strongly that efforts like GA4GH, WDL, CWL, etc should be pushing people to think beyond the unix command line model of computing but I'm willing to wait for a different day to die on that hill :)
@mattions You are correct as both supported languages of WES (WDL and CWL) do indeed provide a Directory
type
My statement wasn't that we should not allow for the description of hierarchical structure, rather is was a suggestion to reframe the discussion to not lock in to just the traditional models. However as I suggested above I'm happy to tilt this windmill another day.
Closing, now the #244 is merged
Some points to discuss here: How similar/different to object methods? Distinguished endpoints for each or different?