Closed sarpera closed 5 years ago
@bwalsh @dglazer we have written up the proposal we had discussed yesterday at the hackathon.
In this issue.
@susheel would be great to know more on the FTP side, because we are not encountering that in our case, and also it would be good if @philloooo can you double check if this gets what we had written on the whiteboard. got Phillis right github handle :)
@mattions - all above looks reasonable. Can you provide a link to document where X-DRS-TOKEN: <TOKEN>
originates?
Hi @bwalsh,
I was aiming to tag @briandoconnor that was present at the hackthon. My mistake, but it's great that you like it, it means we actually wrote it in a clear way :).
The token will be dealt by the DRS server, and so they may offer a way to log in and it's a third part service.
As for links, they will be available when the DRS will be up, it's my guess
@mattions Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object?
Curious why the plan is to name the header X-DRS-TOKEN
rather than Authorization
as per the HTTP specification: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8
If the token is a bearer token then the header would look like: Authorization: Bearer <token>
.
Also, I wonder if it might be too limiting to say that the GET /objects/<id>
API does not require authorization. Could there be cases in which the DRS provider only wants to share the existence of a resource with authorized parties?
@brucehoff My assumption was that it was to discern between "I have authZ to access the endpoint" (i.e. Authorization
) and "Here is my authZ for the file", which might not be the same thing. I could be quite wrong, however
That ties in w/ your second question as I picture the use of the Authorization
as being ubiquitous across all endpoints.
@geoffjentry, yes.
@brucehoff, it could very well be named Authorization
, we just wanted to point out that it should be an agreed name/convention we all follow. +1 for the bearer token case.
To clarify, we also pointed out in the hackathon that GET /objects/<id>
could perfectly require authz if the object in question has an auth layer in front to see the basic metadata about an object (name, size, urls etc). There are cases where an "object metadata" is controlled-access or private, and for those cases the users can pass the same http request header (Authorization: GET /objects/<id>
could require that header in some cases.
@geoffjentry does the paragraph above align with your thoughts?
just fixing my tagging, wants @philloooo to chime in as well :)
Authorization: bearer <token>
if it's a bearer token.@mattions I think so. I can totally see the case where access to the API and access to the file aren't the same and wouldn't be against setting up a structure for that (but perhaps w/ simplifying defaults, e.g. if just Authorization
is passed, just use that).
NB I'm not advocating for that structure, so if others disagree so be it. But at least it doesn't bug me :)
@mattions when you said "yes" was that about the presigned URL question or the authz part?
If the presigned URLs, I didn't feel like we had reached consensus that presigned URLs should be required. If I'm wrong, ignore me :)
Can I suggest for the response to:
GET /objects/<id>
That we do something like this:
{
"object": {
"id": "string",
"name": "string",
"size": "string",
...
"uris": [{
"uri": "s3://<foo>/<bar>.bam",
"metadata": {
"region": "us-east-1",
"provider": "s3.amazonaws.com"
}
},
{
"uri": "gs://<foo>/<bar>.bam",
"metadata": {
"region": "us-west1",
"provider": "storage.googleapis.com"
}
},
{
"uri": "s3://<foo>/<bar>.bam",
"metadata": {
"region": "us-east1",
"provider": "onprem.objectstorage.example.com"
}
},
{
"uri": "ftp://foo.example.com/bar.bam"
},
{
"uri": "drs://<id>",
"metadata": {
"provider": "drs.example.com"
}
}
]
}
}
Specifically, I would leave the aliases array out of this response and make that a different endpoint. I'm not sure theres a use case for it here and it adds a potentially expensive join.
@zflamig Part of the idea here was to specifically separate out the download URL from the metadata, as per discussions at the face to face earlier this week
Thats only the metadata for the URI @geoffjentry to help in resolving it. The original proposal includes this but it isn't clean or extensible. If a new URI appears we would have to amend the spec to add a new class for it...
@geoffjentry
Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object?
/objects/<id>/download?foo=bar
should return "url-to-bytes" in this following use-case:
The main use case is to provide access to a single file in a private bucket in a cloud environment, by URL signing, without giving access to the whole bucket, so that the consumer can access the raw data file bytes by passing a token in the request header.
If the urls
returned from /objects/<id>
do not need to be signed, or you already have direct bucket access, or the urls returned are open/public access URIs, you wouldn't need to call the /objects/<id>/download?foo=bar
.
For the other cases, if any, where the returned urls
is not ready to give access to bytes, let's add those use cases here to figure if they can also be solved via /objects/<id>/download?foo=bar
.
@zflamig the structure of urls
is open for discussion. The reason why we wanted to key-value pair them was to have a separate model in swagger for a typed url, e.g. for all cloud URLs use CloudURLModel etc. Also, for the future use cases where you would query: /objects/?url_type=ftp
, if that's relevant. Regardless, your suggestion also looks clean and good to me.
@geoffjentry I think @zflamig referred to aliases
array not be a part of /objects/<id>
, not the urls
array. Our proposal suggests having urls
as part of /objects/id
.
@zflamig about the aliases part, do you support the use cases where aliases can be used for querying? E.g /objects/?alias=doi://<id>
. We should take that discussion to a separate issue though, please feel free to create one.
I second on having a provider
(which is the server domain or root endpoint) field for the uri type that's drs uri. Cuz I don't think we agreed on the DRI uri format being restrictively drs://<server>/...
Thanks for starting this discussion @sarpera . A few comments, some of which were partially raised by others, but I'm not sure I understand where they landed:
1) process question -- have you created a PR to go with this issue, or are you waiting for the discussion to settle down first? Either way can work, since prose can be easier for high-level design, but we won't be able to nail down the details until we get to code.
2) I agree we should delete aliases
from this issue, and discuss it separately -- there are enough moving parts here already.
3) Re signed URLs -- we definitely don't want to require their use, since some DRS implementors said they won't be using them. That means that sometimes when callers want to fetch bytes, they'll have to pass an auth token when calling the URL_TO_BYTES. I think the spec for that is straightforward, and similar to language you use elsewhere -- something like:
Some access methods require an auth token to fetch bytes. It is the client's responsibility to obtain that token, using a procedure documented by the DRS implementor. (Note that a single token can often be used to fetch multiple objects from a single source.)
4) I think you're suggesting that callers only use the /download
method when they need to generate a signed URL, and that they just fetch the bytes directly if not? I guess that works, but it feels confusing to me, requires special knowledge about different access methods, and uses uri-like-strings that aren't used to actually fetch data. Instead, I was picturing:
a) the first method (GET /object/<id>
) takes an <id>
as input and returns an array of access-methods
, which include info about the type of access (e.g. "ftp", "GCP us-east", "AWS us-west"), but don't necessarily include a URI
b) the second method (GET /object/<id>/download
) takes an <id>
and an <access-method>
as input, and returns an URL_TO_BYTES uri.
c) As an optional shortcut, if the DRS implementor chooses, they can include an URL_TO_BYTES uri in individual records returned by the first method, which lets callers skip a step. Implementors would be likely to do that for public content, and for content that can be accessed with a previously-obtained auth token (e.g. gs: and s3:), and unlikely to do that for signed URLs. But we don't have to bake that knowledge into the protocol or the calling code -- the rule for callers is "always use the URI to get bytes; if you don't get one from the first method, ask for one using the second method".
5) I agree that DRS implementors can choose to require an auth token before they respond to a GET request for metadata, and they can return an "access denied" if appropriate. We should document that somewhere, as we do for WES, but I think that auth policy and those tokens are completely separate from the policy and tokens used to call the /download
method, or passed to the URL_TO_BYTES uri.
6) I agree with @brucehoff that the auth token passed to /download
feels like standard HTTP auth, and should be a standard HTTP auth token. (Since the only question is whether you're allowed to fetch the bytes; there's no use case I can imagine for "it's okay to call the download method but it's not okay to download the object".)
Minor points -- these can wait until we get the big picture settled:
7) Why would a call to DRS return an access-method of type DRS? (As in your fourth example.) I'm not following the use case that addresses.
8) Are you proposing that id
, name
, and size
the only metadata fields we support? Or is that just a placeholder in this issue, and we can figure out the actual fields in a separate issue?
9) /download
feels a little off to me, especially for cloud-to-cloud use cases. Maybe /bytes
or /body
instead? I'm open.
@sarpera Ahhhh. I see what you were trying to accomplish now. I am okay with your method if we break it up and use the protocol as the high level container. So like
"urls": {
"s3": [
{
"uri": "s3://<foo>/<bar>.bam",
"region": "us-east-1",
"provider": "aws"
}
],
"gs": [
{
"uri": "gs://<foo>/<bar>.bam",
"region": "us-west1",
"provider": "google"
}
],
"ftp": [
{
"uri": "ftp://foo.com/bar.bam"
}
],
"drs": [
{
"uri": "drs://foo.com/objects/<id>"
}
]
}
This way we can make it easier for future additions... if you have a new protocol you are free to require whatever metadata you want.
@dglazer re: #7 on your list: the use case there is for data bundles, so you can have one DRS url that dereferences to a group of DRS urls. For example, during the cohort creation process a GUID/DRS entry may be minted to represent the data that the user selected.
@zflamig thanks, it gets better with every iteration. +1 for your suggestion.
@dglazer
- process question -- have you created a PR to go with this issue, or are you waiting for the discussion to settle down first? Either way can work, since prose can be easier for high-level design, but we won't be able to nail down the details until we get to code.
I haven't created a PR for exact the same reason you pointed out. And you're totally right the details won't be clear without the code changes. Just wanted to discuss with the group some more, as the suggestions are really helping so far.
- Re signed URLs -- we definitely don't want to require their use, since some DRS implementors said they won't be using them. That means that sometimes when callers want to fetch bytes, they'll have to pass an auth token when calling the URL_TO_BYTES. I think the spec for that is straightforward, and similar to language you use elsewhere -- something like:
Yes, the callers would pass an auth token when calling the URL_TO_BYTES. Nothing really changes for those who don't have this use-case. Ideally, the token would be the same across all the endpoints on a DRS server, even more ideally all DRS implementations would have the same means of obtaining it. There is a related issue.
- I think you're suggesting that callers only use the /download method when they need to generate a signed URL, and that they just fetch the bytes directly if not? I guess that works, but it feels confusing to me, requires special knowledge about different access methods, and uses uri-like-strings that aren't used to actually fetch data. Instead, I was picturing:
I guess what I was trying to say was along these lines:
GET /object/<id>
returns a list of urls/access-points, regardless of whether or not those URIs are ready to be consumed. Here is the reasoning:
1- If the some of the urls/access-points are ready to consume for getting bytes (public urls), there is no further action needed to call another endpoint.
2- For cloud URIs, the URIs returned from GET /object/<id>
is enough if the consumer has direct bucket access privileges for the private bucket in question. They also wouldn't need to sign the url. So it adds a value to have those URIs there for this case.
3- Having GET /object/<id>/download?foo=bar
primarily solves the case where a file belongs to a private bucket, and the consumer HAS TO sign a URL for that file in order to access it, without having direct access to the whole bucket. This is the majority of the use-cases for many datasets that have a controlled-layer of access.
4- GET /object/<id>/download?foo=bar
can be used for any other case where the urls returned from GET /object/<id>
is not readily-consumable.
To sum up above, I do really agree with you that:
GET /object/<id>
takes anas input and returns an array of access-methods
and
the second method (GET /object/
/download) takes an and an as input, and returns an URL_TO_BYTES uri.
sounds more reasonable. That way, the implementors MAY choose not the include the URIs in GET /object/<id>
if there is no added benefit. But having the URIs there have a point as I tried to explain above.
- I agree that DRS implementors can choose to require an auth token before they respond to a GET request for metadata, and they can return an "access denied" if appropriate. We should document that somewhere, as we do for WES, but I think that auth policy and those tokens are completely separate from the policy and tokens used to call the /download method, or passed to the URL_TO_BYTES uri.
+1 for this. There could be separate auth policies for both cases, and implementors MAY choose to have the same policy for both if it fits them. I guess this is not against your point.
- I agree with @brucehoff that the auth token passed to /download feels like standard HTTP auth, and should be a standard HTTP auth token. (Since the only question is whether you're allowed to fetch the bytes; there's no use case I can imagine for "it's okay to call the download method but it's not okay to download the object".)
+1 for standard HTTP auth token or a bearer token. One of the possible cases would be that your authz privileges might have been revoked or expired, but you already obtained a token. In that case, the flow is solid anyway, you would get a 403 on GET /object/<id>/download
- Why would a call to DRS return an access-method of type DRS? (As in your fourth example.) I'm not following the use case that addresses.
During the hackathon we were made aware that there are some implementors who will be using DRS as a data-registry service, without necessarily providing a direct access to bytes, but instead pointing out to a another DRS server (via a DRS url) where the data can be accessed. Sort of like "linked DRS"es or DRS of DRSes. @susheel could you please perhaps provide those uses cases?
- Are you proposing that id, name, and size the only metadata fields we support? Or is that just a placeholder in this issue, and we can figure out the actual fields in a separate issue?
Oh no, it was just a placeholder since I didn't want to type every other metadata fields, hence the ...
. Sorry if it was misleading. Related to this, I'm all up for bringing this to another issue where we define the required fields in GET /object/<id>
, based on the break-out session on 1st day of hackathon. Current model on swagger seems out of date.
- /download feels a little off to me, especially for cloud-to-cloud use cases. Maybe /bytes or /body instead? I'm open.
Same here, open for any ideas. The most difficult part of building anything is to name it =) /bytes
, /access
?
From the GA4GH call today, @sarpera and @susheel discussed what happens with a DRS entry for an object when you call GET id/download... OK to not implement seems to be the consensus.
Seems like we need to clarify how "/download" works for the various URI types
@dglazer proposed get bytes URI, fetch bytes ID... for passing to the download method. so the ID -> URI
@sarpera is going to take this ticket and make a PR that explores what he and David talked about today... sort out the URL and the download in a single PR. @dglazer @rishidev and I will work out a process to bring this and other PRs up to vote via the active drivers
Need to clarify /download
for non-cloud (legacy) data endpoints, e.g:
ftp
gsiftp
globus
aspera
Wrapping up so far
Good to see that there is a general consensus on the main idea that accessing "object metadata" and "bytes to object" may be separate calls to DRS for the cases where an "action" is required to be performed to get access to bytes e.g passing an auth token to: sign a URL, generate url-to-bytes with credentials etc.
By doing so, I guess we all agree that the schema should remain generic, flexible and understandable, yet providing programatically parsable responses for the clients with different needs and use-cases. With that in mind, I tried to combine our ideas together and here's the outcome:
Object metadata:
GET https://example.com/ga4gh/drs/v1/objects/<id>
Returns metadata of an object, with the set of access-methods.
Object bytes:
GET https://example.com/ga4gh/drs/v1/objects/<id>/download/<access-method-id>
Returns a "uri-to-bytes" for a given access-method-id, if it exists.
Examples
GET https://example.com/ga4gh/drs/v1/objects/<id>
Response:
{
"object": {
"id": "foo",
"name": "bar.bam",
"size": "1234",
"urls": {
"s3": [
{
"uri": "s3://foo/bar.bam",
"region": "us-east-1",
"<access-method-id>": "s3-1"
}
],
"gs": [
{
"uri": "gs://<foo>/<bar>.bam",
"region": "us-west1",
"<access-method-id>": "gs-1"
}
],
"ftp": [
{
"uri": "ftp://foo.com/bar.bam"
}
],
"drs": [
{
"uri": "drs://foo.com/objects/<id>"
}
]
}
GET https://example.com/ga4gh/drs/v1/objects/<id>/download/<access-method-id>
with Request Header
Authorization: <string>
Response:
{ "uri": "<uri-to-bytes>" }
Let's break apart the suggested urls
property of an object
:
"urls": {
"<access-method>": [
{
"uri": "<string>",
"<access-method-specific-attr>*": "<value>"
}
]
}
where
<access-method>
is enumeration, e.g s3 | gs | ftp | http | drs | gsiftp | globus | aspera
<access-method-specific-attr>
is an attribute that belongs to a specific access-method
, described in the schema model (can be multiple attributes per access-method
).
Questions
Why have a key-value paired access methods?
So that a specific <access-method>
can have its own properties in the schema model.
E.g: in the cloud scenarios, there is a huge added value of having the region information for a URI whereas for other <access-method>
s that property may be meaningless. Swagger schema model should set the expectations for client to consume this information programatically.
Why the value of <access-method>
is an array?
One access method can have multiple URIs with different attributes. E.g: data duplicated on different regions on the same cloud provider:
In the example below, access-method
s3 has two entries.
"urls": {
"s3": [
{
"uri": "s3://foo/bar.bam",
"region": "us-east-1",
"<access-method-id>": "s3-us"
},
{
"uri": "s3://baz/bar.bam",
"region": "eu-central-1",
"<access-method-id>": "s3-eu"
}
]
}
What if all the <access-method>
s are public or readily consumable?
Then the <access-method>
wouldn't have a <access-method-id>
property to begin with. Any calls to non-existing /download/<access-method-id>
would return 400.
Example:
"urls": {
"ftp": [
{
"uri": "ftp://foo/bar.bam"
}
]
}
How to mint an <access-method-id>
?
As long as it's url-encoded and unique per object, it can be any string value, up to the implementor.
Help needed with naming things!
How to name <access-method-id>
property?
id
? fetch-id
? access-id
?
Naming the suggested new path, currently "download"
Some people raised concerns about calling it download
. Any ideas? bytes
? fetch
?
TODOs
Will make a PR with the suggested changes reflected on the swagger schema.
On the naming side I propose:
access-method-id
to keep it like it isdownload
for access
so the Urll will look like:
https://example.com/ga4gh/drs/v1/objects/<id>/access/<access-method-id>
So something like this:
{ "object":
{ "id": "foo",
"name": "bar.bam",
"size": "1234",
"urls": { "s3": [
{
"uri": "s3://foo/bar.bam",
"region": "us-east-1",
"access-method-id": "s3-1"
}
],
"gs": [
{ "uri": "gs://<foo>/<bar>.bam",
"region": "us-west1",
"access-method-id": "gs-1"
}
],
}
}
will have the following allowed calls:
# for s3
GET https://example.com/ga4gh/drs/v1/objects/foo/access/s3-1
# for gs
GET https://example.com/ga4gh/drs/v1/objects/foo/access/gs-1
with Request Header
Authorization: <string>
If we have consensus, we can move this next with the PR
_[updated to rename uri
to access_uri
, and to use underscores]_
Thank you @sarpera for incorporating everyone's input -- I personally think this is very close, and ready to move to a PR for final discussion. I'm fine with /access
as the URL for the second method. I have a few suggestions on the response to the first method, GET /objects/<id>
:
uri
the caller uses it to fetch bytes directly (and is responsible for knowing what if any auth tokens they need to pass), if they return an access_method_id
the caller uses it to call the /access
method, and if they return both (which I think will be rare) the caller can choose. urls
to access_methods
, which better matches the actual array elements.access_id
to access_method_id
, just because it's shorter. Again, I'll go with the sense of the crowd.uri
to access_uri
, which makes it very clear what it's for. (And is nicely parallel to access_id
.)Incorporating my proposals, your example would look like:
"access_methods": [
"s3": { # there's no uri, meaning the caller has to call /access before fetching bytes
"region": "us-east-1",
"access_id": "s3-us"
},
"s3": {
"region": "eu-central-1",
"access_id": "s3-eu"
},
"gs": { # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
"region": "us-west1",
"access_uri": "gs://foo/bar.bam",
"access_id": "gs-1"
},
"ftp": { # there's no access_id, meaning the caller has to fetch the bytes directly
"access_uri": "ftp://foo.com/bar.bam"
}
]
Thanks @dglazer, I'm also happy to see that PR I'm setting was actually pretty close to your input.
@mattions 2. swap
download
foraccess
@dglazer I'm fine with/access
as the URL for the second method.
I agree. Already used /access
to be a new path in my local changes for the PR.
- I suggest renaming
urls
toaccess-methods
, which better matches the actual array elements.
I agree. In my git local changes I already set to it be access_methods
following the naming convention on the yaml.
- To support access methods that have multiple entries (as in your S3 example), I slightly prefer flattening things by making the top-level an array, and allowing multiple entries for any given type of access. But that's just taste; I'll go with whatever most people think feels natural.
After diving into the yaml code, I also figured having an array will make things a bit easier to describe via swagger v2.0. Also, opens future possibilities to make the items in the array more searchable in a uniform way. I'm swaying away from s3: {foo: bar}
idea, seems like it won't be a general-enough solution to encompass future cases e.g Azure, Ali cloud and wherever else protocol/pseudo-protocol people might use to point to their data. @zflamig also pointed out the issues with extensibility aspect of it, I agree with that. See the notes below.
- I slightly prefer
access-id
toaccess-method-id
, just because it's shorter. Again, I'll go with the sense of the crowd.
I agree. Already used access_id
in my local changes for the PR.
- I suggest not returning a
uri
unless you can actually use it to fetch bytes. That means the DRS implementor can choose what kinds of access are supported -- if they return auri
the caller uses it to fetch bytes directly (and is responsible for knowing what if any auth tokens they need to pass), if they return anaccess-method-id
the caller uses it to call the/access
method, and if they return both (which I think will be rare) the caller can choose.
As you mentioned, there are use cases (perhaps rare) that you might have both uri
and access_id
at the same time. Example case: a controlled-access file stored in a cloud bucket will have a URI s3://foo/bar.bam
which might be enough for a user who has a direct bucket access, but for the ones who don't have a direct bucket access, they will rely on signing a URL via access_id
. But the schema would require having the access_id
property to be used in /access/<access_id>
path.
GET /objects/{object_id}/access/{access_id}
via Authorization
Request Header
Retrieve a URL to access bytes of an (controlled-access) Object
Response:
{
"url": "string"
}
GET /objects/{object_id}
Response:
{
"object": {
"id*": "string",
"name*": "string",
# ... rest of the properties
"access_level*": "open | controlled",
"access_methods*": [
{
"uri*": "string",
"access_id*": "string",
"cloud_metadata": {
"region*": "string",
"provider*": "string"
},
"protocol*": "string"
}
]
}
}
Some notes:
access_level
It would help to be programatically explicit about the access level for a data object to set expectations on accessing bytes. Since all of the access_methods
would conform to this, could be a property of a data object.
Values could be enumerated as either open
or controlled
.
@dglazer, I guess it also relates to this issue: https://github.com/ga4gh/data-repository-service-schemas/issues/25
uri
Does anyone have a strong opinion not to have this field as required?
access_id
URL-encoded identifier for an access point, to be used in /access/<access_id>
path.
cloud_metadata
Optional field for cloud access methods. Motivation for moving this as an object to have a more modular model for an access_method
, as well as extending child properties for future needs e.g zone, egress cost details etc.
region
Won't make sense to enumerate values since it is cloud provider dependant. Though the format somehow should follow the naming convention of a particular provider e.g us-west-1 in aws vs us-west1 in gcp. Will investigate to see if there is a formal spec for formatting this.
provider
This would help findability for the cases where an object has a lot of mirrors in the cloud, and also to set expectations of region
format. E.g ali, aws, gcp, azure etc
protocol
Wasn't sure how to name this right. Idea is to enumerate URI "types?" of access methods in order to set expectations programatically. E.g ftp, s3, gs, http, https, drs, globus etc etc
Values won't be likely to conform to rfc7595, since we have cases for using psedo-protocols.
Great discussion. It is rewarding to see this work move forward.
Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost.
How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]?
Also, I'm assuming the checksums object is part of '# ... rest of the properties' ?
Forgive me if I've missed it, but are there formal dependencies to a (probably separate) Search Service to query this data?
BTW, I always thought that urls was misnamed, nice to see it morph to access_methods.
@sarpera I'd like to remind a point that Zac and me mentioned in previous comments, for the drs
access_method, it needs a provider field so we are not limiting the drs uri to put the hostname in the identifier.
{
"uri": "drs://<someid>",
"provider": "drs.example.org",
}
I think we need to put a DNS name in the DRS URI. Otherwise, we have to get into the business of service resolution. I don't have much appetite for boiling that ocean.
@sarpera -- glad we're converging. Happy to hash out the remaining details in the PR, but in case it's helpful here are a few thoughts on your latest comment:
access_type
for protocol
? Then every method would have an access_type
and at least one of an access_uri
and/or access_id
.provider
field in addition to access_type
-- when would those be different? (e.g. I think provider = "AWS" if and only if access_type="s3".)uri
-- I feel pretty strongly it shouldn't be required for methods where the caller can't use it (e.g. for unsigned s3 uris in an implementation where only signed urls are allowed).access_level
-- I'd prefer to leave that out of the spec, at least for now -- I expect different implementations will have a wide range of needs here, so it's a good area for pre-standard experimentation. (But I'd be interested to hear which of our implementors expect they would use it.)access_type
. (For example, we expect a region if the type is gs or s3, and not if it's ftp or aspera.) cloud_metadata
properties in a sub-object. Is the idea that we have strong opinions about a set of properties that will be shared across several access types (e.g. s3, gs, Ali, Azure), but won't be shared with other access types?region
property; types that don't need it ignore it.) At the other extreme we have a different fully-typed object for each access type, with different types often using the same name for some of their properties. (E.g. s3 and gs types would both be defined to include a region
.) x-property
naming convention if we wanted.) That means that we don't have to get v1 perfect, and we can incorporate widely-used new properties in v2. cloud_metadata
.Extending @dglazer's example, why can't we be explicit as below?
GET /objects/{object_id}
"access_methods": [
"s3": {
"access_id": "drs://server.com/access/s3-us",
"region": "us-east-1",
},
"s3": {
"access_id": "http://server.com/get_object/s3-eu",
"region": "eu-central-1",
},
"gs": { # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
"region": "us-west1",
"access_uri": "gs://foo/bar.bam",
"access_id": "drs://server.com/access/gs-1"
},
"ftp": { # there's no access_id, meaning the caller has to fetch the bytes directly
"access_uri": "ftp://foo.com/bar.bam"
}
]
This way implementors can also point the user to use external services to implement the /access
method.
I'm personally agianst baking the /access
method into the DRS specification, but I may be in the minority and happy to commit if the community goes this way. If we do, we need to agree on #214 as there will many mechanisms to get signed URLs.
I'm still unclear how this will work for non-cloud private data URLs (FTP, GSIFTP, etc.). Could "contact": "John Doe <john@doe.com>"
be added to each access_method
class? Or can we have a pseudo access_id
that returns the contact info
for private FTP, GSIFTP, etc.?
@philloooo -- I suggest we split discussion of the drs
access_method into a separate issue/PR; I don't think the details of that discussion will affect the the outcome of this discussion.
@bwalsh , re a separate Search Service -- my mental model is that there aren't any formal dependencies, but there's an expectation that, once a search/discovery API is defined, many callers of DRS will use it to get the object ids they pass in to DRS.
@dglazer @sarpera
I don't understand why we need a
provider
field in addition toaccess_type
-- when would those be different? (e.g. I think provider = "AWS" if and only if access_type="s3".)
This is explicitly being driven by a GDC/DCF use case where we run on-premise object storage systems that use an S3 compatible API so we need to know if the file is actually on Amazon or our local object storage. We could potentially overload the region to capture it too, but just having a provider field seems more clean.
Thanks @zflamig for the explanation -- my quick reaction is that's cleaner to represent as either a pseudo region
(if from the caller's point of view it behaves exactly like S3 except the bytes are in a physically different place), or a different access_type
(if the caller needs to be aware of more method-specific differences). Basically I'd rather keep the API itself simpler for the majority of callers, and have implementation-specific needs fit into implementation-specific extensions. Thoughts?
@dglazer In general I agree, but in practice with thinking how clients might actually interact with this information I feel like keeping it separate is easier. For example, a client that only knew how to support AWS S3 would have a very simple check on the provider
to see if its the AWS hostname. When doing the region
's they would have to know a list of all the current AWS regions to know if the listed region
is real or not.
Ultimately, I'm happy either way so long as we agree to support this in some fashion. I just have a strong preference towards clients being able to write stable code that is easy to test and doesn't need to be updated when AWS adds new regions.
My $.02 the method, provider, and region should be separate. For example, it might be cheaper to transfer between two Google cloud regions than to transfer between AWS and Google cloud regions that are physically closer. (On the other hand, physically closer might be faster).
As I mentioned at the F2F, a client library that implements a preference matrix for deciding how to fetch data would probably help shed some light on the best way to represent this.
Thanks for all the feedback!
@bwalsh
Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost.
Exactly, this is what drove us initially to use a defined language to describe the access methods.
How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]?
One way to go for it is to have strongly typed schema model per an access method and enforce required params thereof. Schema model for those Individual access methods should organically evolve when we get more use-cases iterated over in time.
Also, I'm assuming the checksums object is part of '# ... rest of the properties' ?
Yes. Wanted to skip details since there is an issue for that already.
@zflamig @philloooo
SevenBridges also has the same use case as Zac defined for cloud URIs. I'll suggest adding provider
only for cloud-related access methods. Those for the DRS urls, I agree with @dglazer that the resolution of DRS URL might get tricky if they are coupled with provider
info. For the cases Zac defined, any strong opinions against having provider
string in s3, gs etc? See the updated model below.
@susheel Yes, being explicit seems like where most of us align. I also agree that #214 needs to agreed upon.
The strong case, at least for us, to push for /access
is that our DRS server will be the same service who'll provide signed URLs on demand for private cloud resources in DRS. Avoiding this would make DRS urls pointing to a controlled-access data quite useless on their own. With the proper authN/Z in place, DRS would provide access to private/protected/public data in our case, making DRS URLs programatically actionable, especially for WES scenarios.
I'm still unclear how this will work for non-cloud private data URLs (FTP, GSIFTP, etc.). Could
"contact": "John Doe <john@doe.com>"
be added to eachaccess_method
class? Or can we have a pseudoaccess_id
that returns thecontact info
for private FTP, GSIFTP, etc.?
It would help greatly if you could provide a complete use case for non-cloud private data URLs.
How is the data actually accessed after a user is given privileges? What happens when I contact the author and I'm given access somehow? Do I add my credentials to the URI of the ftp file? If we add contact info per access method, how would we utilise this programatically?
Overall, it would be awesome to cover as much cases as possible with the idea behind /access
or shape it differently based on the use cases.
@tetron
My $.02 the method, provider, and region should be separate. For example, it might be cheaper to transfer between two Google cloud regions than to transfer between AWS and Google cloud regions that are physically closer. (On the other hand, physically closer might be faster).
Agreed. Since we seem to go for strongly typed access methods, we could make cases for cloud-related methods to provide these information. Seven Bridges and @zflamig also have use-cases for explicit provider
property. Note that it doesn't affect the original proposal i.e /access/<access-id>
Schema
Object
has access_methods
property which is an array of AccessMethod
s:
"access_methods": <AccessMethod>[]
where an AccessMethod
is:
{
"<x>": <xAccessMethod>
}
DRS defines the values for x
and their corresponding schema models i.e xAccessMethod
in the specification.
Example AccessMethod
s:
{
"s3": {
"uri*": "string",
"access_id": "string",
"region*": "string",
"provider": "string",
"allowed_regions": [
"string"
]
}
}
{
"drs": {
"uri*": "string"
}
}
{
"ftp": {
"uri*": "string"
}
}
Example response of an object:
{
"object": {
"id": "1234",
"name*": "bar.bam",
# ... rest of the properties
"access_methods": [
{
"s3": {
"uri": "s3://foo/bar.bam",
"access_id": "s3-1",
"region": "us-west-1",
"provider": "s3.amazonaws.com",
"allowed_regions": [
"us-west-1", "us-east-1"
]
}
},
{
"gs": {
"uri": "gs://foobaz/bar.bam",
"access_id": "gs-1",
"region": "us-central1",
"allowed_regions": [
"us-central1"
]
}
},
{
"ftp": {
"uri": "ftp://foo.org/baz/bar.bam"
}
},
{
"drs": {
"uri": "drs://some-other-drs.org/9876"
}
}
}
]
}
Initial idea of having strongly typed access methods seems to be favoured by most of us. Please note that with this approach, in order to add a new access method we'd need to define it and update the schema. It is of course expected to have more properties in said AccessMethod
s. Now that we seem to agree on being explicit about individual access methods, there is a room for that. IMHO it adds value in the long run when it comes to setting expectations for the DRS consumer/client to parse this information.
Thoughts?
What is allowed_regions
?
Why {"s3": { ... } }
instead of {"method": "s3", ...}
?
What is the difference between uri
and access_id
? I see they are different here but is there a reason it can't just provide uri
when using the /access/
endpoint?
For the ftp
case, perhaps the provider
should be the ftp host? I think it is okay if region
is sometimes null
but I think provider
should always be filled in, even if it is just a hostname.
I am thinking about the client's decision matrix. I think we want a tuple of (method, provider, region)
and the client assigns a preference or weight to each tuple based on (a) availability of credentials and (b) function of cost and expected transfer speed.
For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?)
For the private access case, the client can have a table of credentials that correspond to various combinations of (method, provider, region)
(could include wildcards.)
@sarpera For the ftp
AccessMethod, see example below
{
"method*": "ftp",
"provider*": "string"
"uri*": "string",
"region": "string",
"contact": "string"
}
Fully realised example:
{
"method": "ftp",
"provider": "ftp.ebi.ac.uk"
"uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
"region": "null",
"contact": "Contact John Doe <john.doe@example.com>"
},
{
"method": "ftp",
"provider": "ftp-private.ebi.ac.uk"
"uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
"region": "ebi-hh",
"contact": "Contact Jane Doe <jane.doe@example.com>"
}
I'm guessing it would be the same for gridftp
, sftp
, Globus and Aspera will be a little complicated - I would need to think about this a little more.
@sarpera Do you see the possibility of having a local
AccessMethod too. Example below:
{
"method*": "local",
"provider*": "string"
"uri*": "string",
"region": "string",
"contact": "string"
}
Fully realised example:
{
"method": "local",
"provider": "ebi-cluster.ebi.ac.uk"
"uri": "file://public/path/file",
"region": "ebi-hx",
"contact": "Contact John Doe <john.doe@example.com>"
},
{
"method": "local",
"provider": "ebi-yoda.ebi.ac.uk"
"uri": "file://private/path/file",
"region": "ebi-hh",
"contact": "Contact Jane Doe <jane.doe@example.com>"
}
@tetron
What is
allowed_regions
?
Buckets can be set to incur outbound (egress) costs outside of its region in the same cloud provider. This provides more information in the decision making process to pick the most appropriate mirror of the file. Perhaps not the best name for the attribute though.
Why
{"s3": { ... } }
instead of{"method": "s3", ...}
?
The former allows to define a schema model per access method so that method-specific attributes can be defined and enforced for consistency. Happy to discuss if the same goal can be achieved in a different way.
What is the difference between
uri
andaccess_id
? I see they are different here but is there a reason it can't just provideuri
when using the/access/
endpoint?
@dglazer also made some points about it. There may be cases where for a specific access method URI may not give any means of access e.g a file residing in a VPC and the only means of providing access to third-parties is signing a URL via /access
method. We keep both attributes, for the cases like having both direct bucket access and option to sign a URL on demand, depending on the consumer of the object. Then both attributes in fact can provide access and meaningful to have.
Please also note that the cloud data owners may not want to (or be allowed to) expose their bucket names in the URIs, but may provide access via /access/<access>
with the proper authZ.
We can pursue some additional capabilities where; while keeping /access/<access_id>
, providing /access?uri="<uri_string>"
. But IMVHO having a dedicated path like /access/<access_id>
is much cleaner and approachable considering above cases.
I am thinking about the client's decision matrix. I think we want a tuple of
(method, provider, region)
and the client assigns a preference or weight to each tuple based on (a) availability of credentials and (b) function of cost and expected transfer speed.
This is a very important point and setting the individual access method attributes by aiming that goal would help us achieve that. I hope this aligns with your second question and answer I tried to provide.
For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?)
Yes, exactly. Having that dedicated path /access/<access_id>
will make this clear in the schema since the attribute required to craft this path will be enforced. This partially also answers your 3d question. Happy to explore alternative approaches if this isn't intuitive.
For the private access case, the client can have a table of credentials that correspond to various combinations of
(method, provider, region)
(could include wildcards.)
Could you please explain this a bit more with examples? Are you talking about discoverability of the available access methods based on existing client conditions?
@susheel thanks for the examples.
For the
ftp
case, perhaps theprovider
should be the ftp host? I think it is okay ifregion
is sometimesnull
but I thinkprovider
should always be filled in, even if it is just a hostname.
Based on the previous schema definitions I provided, each defined access method would have its own attributes based on its needs. So region would be null for ftp cases. Similarly if the provider
doesn't add any information, we don't need to add that for ftp case. So we could do something like this:
access_methods: [
{
ftp: {
"uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
"contact": "Contact John Doe <john.doe@example.com>"
}
},
ftp: {
"uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
"contact": "Contact John Doe <john.doe@example.com>"
}
}
]
@sarpera Do you see the possibility of having a
local
AccessMethod too. Example below:
local
could be a new access method then I presume. Let's gather more use cases to define its attributes.
@sarpera For ftp
I would agree with @tetron's previous comment:
For the
ftp
case, perhaps theprovider
should be the ftp host? I think it is okay ifregion
is sometimesnull
but I thinkprovider
should always be filled in, even if it is just a hostname.
Having a provider
set for all access_methods
even if it just a hostname would make filtering easier, so I would add this into your example above.
region
should be available to the ftp
access method, e.g. "region": "ebi-hx"
or may be optionally set to null, e.g. when behind a loadbalancer.
@susheel making filtering easier by using provider
is a solid point. I guess the only argument against it was to couple the provider and URI resolution for the DRS case. I'm all up for having a provider
field as long as it's not coupled with means of accessing, which should be a job of uri
field or /access/<access_id>
method.
So updated example would be:
access_methods: [
{
ftp: {
"uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
"contact": "Contact John Doe <john.doe@example.com>",
"provider": "ftp.ebi.ac.uk"
}
},
ftp: {
"uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
"contact": "Contact John Doe <john.doe@example.com>",
"provider": "ftp-private.ebi.ac.uk"
}
}
]
I feel like contact
property could be defined strongly. Maybe more structured? We could be more explicit about the value so set expectations right. Perhaps a field for just email as a value, or a field for ORCIDs? Just to avoid open-ended, vague string values.
region
should be available to theftp
access method, e.g."region": "ebi-hx"
or may be optionally set to null, e.g. when behind a loadbalancer.
Is region
a known terminology for ftp cases? In the cloud cases, the meaning of the property is quite well-established. Naive question; how would region
affect decision making on picking the right access method for FTP-like URIs?
Seems like current version of OpenAPI doesn't allow patternProperties as of 3.0.2. Unless we want to hardcode all available access methods (ftp. http, s3 etc) in the schema and pair them with a schema model (array of objects), the above approach won't work in practice.
@tetron
Why
{"s3": { ... } }
instead of{"method": "s3", ...}
?
Going back to this approach, seems like with open api v3 this can be achieved while still enforcing certain properties per access method (region for cloud methods etc) by making use of anyOf
when defining the access_methods
array.
I'd like to recap our requirements so far about the access methods before adjusting them to v3.0.
access_methods
And so far our use cases for an access method are:
Required / Optional bare minimum properties are different:
Use Case | URI |
method |
region |
access_id |
provider |
contact |
---|---|---|---|---|---|---|
Cloud - open | R | R | R | O | O | O |
Cloud - controlled/private | O | R | R | R | O | O |
FTP, HTTP, globus, aspera etc | R | R | - | O | O | O |
DRS of DRSes | R | R | - | O | O | O |
Local | R | R | - | ? | O | O |
Based on that, examples would look like:
Open access data:
"access_methods": [
{
"uri": "ftp://foo.example.com/file.name",
"method": "ftp",
"provider": "foo.example.com"
},
{
"uri": "s3://foo-open-bucket/file.name",
"method": "s3",
"provider": "s3.amazonaws.com",
"region": "us-east-1"
}
],
Controlled/private access data:
"access_methods": [
{
"uri": "ftp://bar.example.com/file.name",
"method": "ftp",
"provider": "foo.example.com",
"contact": "foo@example.com"
},
{
"access_id": "s3-1",
"method": "s3",
"region": "us-east-1"
},
{
"uri": "drs://foo.example.com/123",
"method": "drs"
}
],
Enforcing required/optional properties can be done via anyOf
and enumerated values for method
property in the schema. (please not that inheritance in OpenAPI does not allow overriding required properties, hence some redundancies bellow)
AccessMethods:
type: array
description: The list of access methods that can be used to access the Data Object.
minItems: 1
items:
anyOf:
- $ref: '#/components/schemas/StaticAccessMethod'
- $ref: '#/components/schemas/CloudAccessMethod'
- $ref: '#/components/schemas/ActionableStaticAccessMethod'
- $ref: '#/components/schemas/ActionableCloudAccessMethod'
discriminator:
propertyName: method
ActionableAccessMethod:
type: object
required:
- access_id
properties:
access_id:
type: string
ActionableCloudAccessMethod:
type: object
allOf:
- $ref: "#/components/schemas/ActionableAccessMethod"
- type: object
required:
- region
- method
- access_id
properties:
uri:
type: string
method:
type: string
enum:
- s3
- gs
region:
type: string
description: >-
Name of the region in the cloud service provider that the object belongs to.
example:
us-east-1
provider:
type: string
CloudAccessMethod:
type: object
required:
- uri
- region
- method
properties:
uri:
type: string
provider:
type: string
method:
type: string
enum:
- s3
- gs
region:
type: string
description: >-
Name of the region in the cloud service provider that the object belongs to.
example:
us-east-1
ActionableStaticAccessMethod:
type: object
allOf:
- $ref: "#/components/schemas/ActionableAccessMethod"
- $ref: "#/components/schemas/StaticAccessMethod"
StaticAccessMethod:
type: object
required:
- uri
- method
properties:
method:
type: string
enum:
- ftp
- sftp
- http
- https
- nfs
- globus
- aspera
- gsiftp
- nfs
- local
uri:
type: string
provider:
type: string
contact:
type: string
@sarpera Thanks for investigating the OpenAPI spec compatibility. I agree with @tetron it would have been cleaner, but I guess we will have to live within our means! :)
I thought we'd discussed (maybe not agreed) that we will be more explicit with the access_id
to be a uri. E.g.
Controlled/private access data:
"access_methods": [
{
"access_id": "drs://server.com/access/s3-1",
"method": "s3",
"region": "us-east-1"
}
{
"access_id": "http://server.com/get-object/s3-1",
"method": "s3",
"region": "us-east-1"
}
],
Which I hope will work for your use case when it is provided by the DRS service, and when it may be provided by a third-party service.
P.S. If this is acceptable, why have it called access_id
, we could just call it uri
@susheel access_id
was made explicitly to be used in this path /objects/<id>/<access_id>
, which will generate or return a ready-to-use (signed url, url with encoded credentials etc) URL to bytes, via Authorization
request header. It should not be used interchangeably with uri
, they serve a different purpose.
In your example,
{
"access_id": "drs://server.com/access/s3-1",
"method": "s3",
"region": "us-east-1"
}
Please note that access_id
is unique for an object, not per DRS server unlike the DRS URL. So it would have to be drs://server.com/objects/<object_id>/access/s3-1
.
Keeping that in mind, method
and region
info is redundant, since the same info would be available on drs://server.com/objects/<object_id>/
. And if the file is moved to another region, you'd need to have a mechanism to ping the DRS who links it and update the redundant info.
Also it's ambiguous what token value for Authorization
request header should be used in the linked DRS server URI, since it could be a different value.
If with DRS of DRSes we are aiming to redirect the client to another, this is indirect but not ambiguous:
{
"uri": "drs://server.com/<object_id>",
"method": "drs"
}
Alternatively, we could utilise the alias property of an object, to link/mirror another DRS URLs.
GET /objects/<id>
{
"id": 123,
"name": "foo",
"checksums": ["# list here"],
"access_methods": ["# list here"],
# rest of the props
"alias": ["drs://server.com/<object_id>"]
}
1) @sarpera, thanks for continuing to deep dive into OpenAPI syntax. Given what you found, I agree with the general direction of your latest proposal. Notes:
Actionable...
methods from the others? IIUC (which I may not), the idea is to distinguish one-step from two-step access. But I believe all the properties are the same in the two cases, except for whether access_url
(one-step) or access_id
(two-step) are expected. So if our OpenAPI just always allowed both, and our documentation said "you must provide at least one", we'd be fine, and the spec would be easier to maintain.
region
should be required. If servers don't want to specify a region (either because they don't offer choice, or because they're using multi-region storage), and callers are okay with that, we should allow it.provider
, but could be missing something. Every use case I can think of where the caller actually cares about provider
feels better modeled by introducing a new method
. Does someone have a good counterxample? @susheel, I know you said that _"Having a provider
set for all access_methods
even if it's just a hostname would make filtering easier"_, but I don't understand why -- do you have an example in mind, where filtering by method
wouldn't be just as good?contact
, since I don't know what clients are supposed to do with it. I can picture some possible value to the developers building a client, but that seems like an odd thing to put into the mainstream API, vs. (e.g.) the /service-info
endpoint. I suggest we leave it out for now, and then if someone feels strongly we can open a separate issue/PR to discuss. access_url
instead of uri
, to make it clear that you use it to fetch the actual object bytes, as opposed to fetching some intermediate thingmethod
first in your examples. Syntactically it's the same thing, but it makes it more readable, since it's probably the first thing the caller cares about, and it tells the caller what other parameters to expect.2) @susheel, re the format of access_id
-- it did come up earlier, but I don't think we reached consensus. I think we all agree that the end goal is for callers to get an access_url
, which they can use to directly fetch object bytes; the question is how they get that access-url
.
The simple case is when only a single step is needed (e.g. for public content); in that case the server can return an `access_url` directly. The trickier case is when two steps are needed (e.g. for signed URLs).
The two-step pattern I prefer, mostly because the behavior feels more explicit, is:
- servers can return an opaque `access_id` string, in any format they choose
- callers pass that `access_id` to a well-defined method on the same server (e.g. `/objects/<id>/access/<access_id>`), which returns the `access_url`
I believe the pattern you're suggesting is:
- servers can return an `access_url_url` (_name TBD_) string, which must be a fully resolvable HTTP GET'table path, and can be on any server
- callers do an HTTP GET on the `access_url_url`, which returns the actual `access_url`
**Is that right?** If so, I'm open to discussing the tradeoffs. But I agree with @sarpera that we shouldn't mix the patterns, and call the string an `access_id` if it's actually a fetchable URL.
3) I suggest we split discussion of the drs
access_method into a separate issue/PR. I think whatever we end up with here will be able to support that use case, and it will be cleaner to discuss it separately. (I for one am still not clear on the requirements, but don't want to side-track this thread to dive in.)
As discussed in #230, updating to OpenAPI 3.0 may take longer than we'd like, and I'm eager to get the changes discussed here into a PR. So it may make sense to decouple the issues, open up a PR for this issue now doing the best we can using v2, and then revisit whenever #230 is resolved.
I think that will be fine -- we can still use the access_methods
syntax you propose above, but with weaker typing in the OpenAPI definition, so some of the rules for what parameters are valid where would be enforced by policy, not by schema. Not as elegant, but perfectly functional, and we can upgrade to stronger typing later.
I picture something like (without having tested it):
AccessMethods:
type: array
description: The list of access methods that can be used to access the Data Object.
minItems: 1
items:
$ref: '#/components/schemas/AccessMethod'
AccessMethod:
type: object
required:
- method
properties:
method:
type: string
enum:
- s3
- gs
- ftp
- sftp
- http
- https
- nfs
- globus
- aspera
- gsiftp
- nfs
- local
access_url:
type: string
description: >-
A fully resolvable HTTP address that can be used to GET the actual object bytes.
Note that at least one of access_url and access_id must be provided.
access_id:
type: string
description: >-
An arbitrary string to be passed to the /access method to fetch an access_url
region:
type: string
description: >-
Name of the region in the cloud service provider that the object belongs to.
example:
us-east-1
@sarpera -- wdyt? Are you up for creating a PR using OpenAPI v2 now, and confirming it's not too ugly?
I believe the pattern you're suggesting is:
- servers can return an
access_url_url
(name TBD) string, which must be a fully resolvable HTTP GET'table path, and can be on any servercallers do an HTTP GET on the
access_url_url
, which returns the actualaccess_url
Is that right? If so, I'm open to discussing the tradeoffs. But I agree with @sarpera that we shouldn't mix the patterns, and call the string an
access_id
if it's actually a fetchable URL.
@dglazer Yes, almost. I do agree that DRS must be able to support the two-phase access mechanism.
If the DRS server only provides an access_id
it is implict in the service description to perform a subsequent GET /objects/<object-id>/access/<access-id>
to get the access_url
. Is there a use case where a user will only need the access_id
? Making the access_url_url
(your name :) more configurable or explicit in the service description would allow service providers to support external mechanisms to provide other ways to access_urls
.
Either way, I agree with @sarpera we need to also iron out how AUTH tokens are specified and passed to /access
or an access_url_url
in #47 or #229
@susheel , it sounds like we largely agree on the two options; good. @sarpera , I suggest you pick one (you know my vote), put it into the PR, and then we can discuss and finalize there.
A few comments on the details
access_id
and the /access
methodaccess_id
_" -- no, I don't see one. You have to turn the id into a URL somehow in order to fetch object content/access
model, I think the answer matches how we handle auth for all other calls to the DRS server, including to the /object
call that was used to get the access_id
in the first place It feels more complicated in the access_url_url
model, since that's a whole different server, so presumably has a different set of auth needs
Background
Following the discussion we had at the GA4GH hackathon in January we would like to propose to have a method to get the metadata of an object, and then have an additional method which will provide the download of the object.
The rationale to have two methods instead of one, is due to the necessity to sign the object using the authorisation token provider (right now this is based on the OIDC specs), which is expensive computationally to do. More over, with the presence of regions and provider, a DRS client will be able to decide which provider and which region would be best to obtain the file, among all the possible URIs.
The format we propose are:
objects/<id>/
for getting the object metadataobjects/<id>/download
for getting the object bytesand we propose to pass the authorisation token in the Request Header to get access to the object.
This is the flow, from a DRS client point of view:
1) GET
/objects/<id>
2) GET
/objects/<id>/download
with Request HeaderX-DRS-TOKEN: <TOKEN>
The token is obtained by the client from the DRS server, and it is up to the DRS Server implementer to decide how a user will obtain that.
Object metadata Request
This will return the object metadata:
HTTP Request
HTTP Response
The client will be able to pick one of the
cloud
uri and request the download uri, passing the tokenObject download Request
HTTP REQUEST
HTTP Response
The return value is a URI where a GET request will give you the bytes:
a
GET <URL_TO_BYTES>
will start the download of the file.