ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.

Apache License 2.0

60 stars 53 forks source link

Object metadata and download methods #213

Closed sarpera closed 5 years ago

sarpera commented 5 years ago

Background

Following the discussion we had at the GA4GH hackathon in January we would like to propose to have a method to get the metadata of an object, and then have an additional method which will provide the download of the object.

The rationale to have two methods instead of one, is due to the necessity to sign the object using the authorisation token provider (right now this is based on the OIDC specs), which is expensive computationally to do. More over, with the presence of regions and provider, a DRS client will be able to decide which provider and which region would be best to obtain the file, among all the possible URIs.

The format we propose are:

objects/<id>/ for getting the object metadata
objects/<id>/download for getting the object bytes

and we propose to pass the authorisation token in the Request Header to get access to the object.

This is the flow, from a DRS client point of view:

1) GET /objects/<id>

2) GET /objects/<id>/download with Request Header X-DRS-TOKEN: <TOKEN>

The token is obtained by the client from the DRS server, and it is up to the DRS Server implementer to decide how a user will obtain that.

Object metadata Request

This will return the object metadata:

HTTP Request

GET /objects/<id>

HTTP Response

{
  "object": {
    "id": "string",
    "name": "string",
    "size": "string",
    ...
    "urls": {
      "cloud": [
        {
          "uri": "s3://<foo>/<bar>.bam",
          "region": "us-east-1",
          "provider": "aws"
        },
        {
          "uri": "gs://<foo>/<bar>.bam",
          "region": "us-west1",
          "provider": "google"
        }
      ],
      "ftp": [
        {
          "uri": "ftp://foo.com/bar.bam"
        }
      ],
      "drs": [
        {
          "uri": "drs://foo.com/objects/<id>"
        }
      ]
    },
    "aliases": [
      "doi://123/abcd"
    ]
  }
}

The client will be able to pick one of the cloud uri and request the download uri, passing the token

Object download Request

HTTP REQUEST

GET /objects/<id>/download?type="cloud"&uri="gs://<foo>/<bar>.bam" 

    Request Header: 
    X-DRS-TOKEN: <TOKEN>

HTTP Response

The return value is a URI where a GET request will give you the bytes:

{
  "uri": "<URL_TO_BYTES>"
}

a GET <URL_TO_BYTES> will start the download of the file.

mattions commented 5 years ago

@bwalsh @dglazer we have written up the proposal we had discussed yesterday at the hackathon.

In this issue.

@susheel would be great to know more on the FTP side, because we are not encountering that in our case, and also it would be good if @philloooo can you double check if this gets what we had written on the whiteboard. got Phillis right github handle :)

bwalsh commented 5 years ago

@mattions - all above looks reasonable. Can you provide a link to document where X-DRS-TOKEN: <TOKEN> originates?

mattions commented 5 years ago

Hi @bwalsh,

I was aiming to tag @briandoconnor that was present at the hackthon. My mistake, but it's great that you like it, it means we actually wrote it in a clear way :).

The token will be dealt by the DRS server, and so they may offer a way to log in and it's a third part service.

As for links, they will be available when the DRS will be up, it's my guess

geoffjentry commented 5 years ago

@mattions Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object?

brucehoff commented 5 years ago

Curious why the plan is to name the header X-DRS-TOKEN rather than Authorization as per the HTTP specification: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8 If the token is a bearer token then the header would look like: Authorization: Bearer <token>.

Also, I wonder if it might be too limiting to say that the GET /objects/<id> API does not require authorization. Could there be cases in which the DRS provider only wants to share the existence of a resource with authorized parties?

geoffjentry commented 5 years ago

@brucehoff My assumption was that it was to discern between "I have authZ to access the endpoint" (i.e. Authorization) and "Here is my authZ for the file", which might not be the same thing. I could be quite wrong, however

That ties in w/ your second question as I picture the use of the Authorization as being ubiquitous across all endpoints.

sarpera commented 5 years ago

@geoffjentry, yes.

@brucehoff, it could very well be named Authorization, we just wanted to point out that it should be an agreed name/convention we all follow. +1 for the bearer token case.

To clarify, we also pointed out in the hackathon that GET /objects/<id> could perfectly require authz if the object in question has an auth layer in front to see the basic metadata about an object (name, size, urls etc). There are cases where an "object metadata" is controlled-access or private, and for those cases the users can pass the same http request header (Authorization: ) in the GET request to see the metadata about the object. Our suggestion does not limit that in anyway, actually we strongly suggest that GET /objects/<id> could require that header in some cases.

@geoffjentry does the paragraph above align with your thoughts?

mattions commented 5 years ago

just fixing my tagging, wants @philloooo to chime in as well :)

philloooo commented 5 years ago

+1 on just Authorization: bearer <token> if it's a bearer token.

geoffjentry commented 5 years ago

@mattions I think so. I can totally see the case where access to the API and access to the file aren't the same and wouldn't be against setting up a structure for that (but perhaps w/ simplifying defaults, e.g. if just Authorization is passed, just use that).

NB I'm not advocating for that structure, so if others disagree so be it. But at least it doesn't bug me :)

geoffjentry commented 5 years ago

@mattions when you said "yes" was that about the presigned URL question or the authz part?

If the presigned URLs, I didn't feel like we had reached consensus that presigned URLs should be required. If I'm wrong, ignore me :)

zflamig commented 5 years ago

Can I suggest for the response to:

GET /objects/<id>

That we do something like this:

{
    "object": {
        "id": "string",
        "name": "string",
        "size": "string",
                ...
        "uris": [{
                "uri": "s3://<foo>/<bar>.bam",
                "metadata": {
                    "region": "us-east-1",
                    "provider": "s3.amazonaws.com"
                }
            },
            {
                "uri": "gs://<foo>/<bar>.bam",
                "metadata": {
                    "region": "us-west1",
                    "provider": "storage.googleapis.com"
                }
            },
            {
                "uri": "s3://<foo>/<bar>.bam",
                "metadata": {
                    "region": "us-east1",
                    "provider": "onprem.objectstorage.example.com"
                }
            },
            {
                "uri": "ftp://foo.example.com/bar.bam"
            },
            {
                "uri": "drs://<id>",
                "metadata": {
                    "provider": "drs.example.com"
                }
            }
        ]
    }
}

Specifically, I would leave the aliases array out of this response and make that a different endpoint. I'm not sure theres a use case for it here and it adds a potentially expensive join.

geoffjentry commented 5 years ago

@zflamig Part of the idea here was to specifically separate out the download URL from the metadata, as per discussions at the face to face earlier this week

zflamig commented 5 years ago

Thats only the metadata for the URI @geoffjentry to help in resolving it. The original proposal includes this but it isn't clean or extensible. If a new URI appears we would have to amend the spec to add a new class for it...

sarpera commented 5 years ago

@geoffjentry

Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object?

/objects/<id>/download?foo=bar should return "url-to-bytes" in this following use-case: The main use case is to provide access to a single file in a private bucket in a cloud environment, by URL signing, without giving access to the whole bucket, so that the consumer can access the raw data file bytes by passing a token in the request header. If the urls returned from /objects/<id> do not need to be signed, or you already have direct bucket access, or the urls returned are open/public access URIs, you wouldn't need to call the /objects/<id>/download?foo=bar.

For the other cases, if any, where the returned urls is not ready to give access to bytes, let's add those use cases here to figure if they can also be solved via /objects/<id>/download?foo=bar.

@zflamig the structure of urls is open for discussion. The reason why we wanted to key-value pair them was to have a separate model in swagger for a typed url, e.g. for all cloud URLs use CloudURLModel etc. Also, for the future use cases where you would query: /objects/?url_type=ftp, if that's relevant. Regardless, your suggestion also looks clean and good to me.

@geoffjentry I think @zflamig referred to aliases array not be a part of /objects/<id>, not the urls array. Our proposal suggests having urls as part of /objects/id.

@zflamig about the aliases part, do you support the use cases where aliases can be used for querying? E.g /objects/?alias=doi://<id>. We should take that discussion to a separate issue though, please feel free to create one.

philloooo commented 5 years ago

I second on having a provider(which is the server domain or root endpoint) field for the uri type that's drs uri. Cuz I don't think we agreed on the DRI uri format being restrictively drs://<server>/...

dglazer commented 5 years ago

Thanks for starting this discussion @sarpera . A few comments, some of which were partially raised by others, but I'm not sure I understand where they landed:

1) process question -- have you created a PR to go with this issue, or are you waiting for the discussion to settle down first? Either way can work, since prose can be easier for high-level design, but we won't be able to nail down the details until we get to code.

2) I agree we should delete aliases from this issue, and discuss it separately -- there are enough moving parts here already.

3) Re signed URLs -- we definitely don't want to require their use, since some DRS implementors said they won't be using them. That means that sometimes when callers want to fetch bytes, they'll have to pass an auth token when calling the URL_TO_BYTES. I think the spec for that is straightforward, and similar to language you use elsewhere -- something like:

Some access methods require an auth token to fetch bytes. It is the client's responsibility to obtain that token, using a procedure documented by the DRS implementor. (Note that a single token can often be used to fetch multiple objects from a single source.)

4) I think you're suggesting that callers only use the /download method when they need to generate a signed URL, and that they just fetch the bytes directly if not? I guess that works, but it feels confusing to me, requires special knowledge about different access methods, and uses uri-like-strings that aren't used to actually fetch data. Instead, I was picturing:

a) the first method (GET /object/<id>) takes an <id> as input and returns an array of access-methods, which include info about the type of access (e.g. "ftp", "GCP us-east", "AWS us-west"), but don't necessarily include a URI

b) the second method (GET /object/<id>/download) takes an <id> and an <access-method> as input, and returns an URL_TO_BYTES uri.

c) As an optional shortcut, if the DRS implementor chooses, they can include an URL_TO_BYTES uri in individual records returned by the first method, which lets callers skip a step. Implementors would be likely to do that for public content, and for content that can be accessed with a previously-obtained auth token (e.g. gs: and s3:), and unlikely to do that for signed URLs. But we don't have to bake that knowledge into the protocol or the calling code -- the rule for callers is "always use the URI to get bytes; if you don't get one from the first method, ask for one using the second method".

5) I agree that DRS implementors can choose to require an auth token before they respond to a GET request for metadata, and they can return an "access denied" if appropriate. We should document that somewhere, as we do for WES, but I think that auth policy and those tokens are completely separate from the policy and tokens used to call the /download method, or passed to the URL_TO_BYTES uri.

6) I agree with @brucehoff that the auth token passed to /download feels like standard HTTP auth, and should be a standard HTTP auth token. (Since the only question is whether you're allowed to fetch the bytes; there's no use case I can imagine for "it's okay to call the download method but it's not okay to download the object".)

Minor points -- these can wait until we get the big picture settled:

7) Why would a call to DRS return an access-method of type DRS? (As in your fourth example.) I'm not following the use case that addresses.

8) Are you proposing that id, name, and size the only metadata fields we support? Or is that just a placeholder in this issue, and we can figure out the actual fields in a separate issue?

9) /download feels a little off to me, especially for cloud-to-cloud use cases. Maybe /bytes or /body instead? I'm open.

zflamig commented 5 years ago

@sarpera Ahhhh. I see what you were trying to accomplish now. I am okay with your method if we break it up and use the protocol as the high level container. So like

    "urls": {
      "s3": [
        {
          "uri": "s3://<foo>/<bar>.bam",
          "region": "us-east-1",
          "provider": "aws"
        }
     ],
     "gs": [
        {
          "uri": "gs://<foo>/<bar>.bam",
          "region": "us-west1",
          "provider": "google"
        }
      ],
      "ftp": [
        {
          "uri": "ftp://foo.com/bar.bam"
        }
      ],
      "drs": [
        {
          "uri": "drs://foo.com/objects/<id>"
        }
      ]
    }

This way we can make it easier for future additions... if you have a new protocol you are free to require whatever metadata you want.

@dglazer re: #7 on your list: the use case there is for data bundles, so you can have one DRS url that dereferences to a group of DRS urls. For example, during the cohort creation process a GUID/DRS entry may be minted to represent the data that the user selected.

sarpera commented 5 years ago

@zflamig thanks, it gets better with every iteration. +1 for your suggestion.

@dglazer

process question -- have you created a PR to go with this issue, or are you waiting for the discussion to settle down first? Either way can work, since prose can be easier for high-level design, but we won't be able to nail down the details until we get to code.

I haven't created a PR for exact the same reason you pointed out. And you're totally right the details won't be clear without the code changes. Just wanted to discuss with the group some more, as the suggestions are really helping so far.

Re signed URLs -- we definitely don't want to require their use, since some DRS implementors said they won't be using them. That means that sometimes when callers want to fetch bytes, they'll have to pass an auth token when calling the URL_TO_BYTES. I think the spec for that is straightforward, and similar to language you use elsewhere -- something like:

Yes, the callers would pass an auth token when calling the URL_TO_BYTES. Nothing really changes for those who don't have this use-case. Ideally, the token would be the same across all the endpoints on a DRS server, even more ideally all DRS implementations would have the same means of obtaining it. There is a related issue.

I think you're suggesting that callers only use the /download method when they need to generate a signed URL, and that they just fetch the bytes directly if not? I guess that works, but it feels confusing to me, requires special knowledge about different access methods, and uses uri-like-strings that aren't used to actually fetch data. Instead, I was picturing:

I guess what I was trying to say was along these lines:

GET /object/<id> returns a list of urls/access-points, regardless of whether or not those URIs are ready to be consumed. Here is the reasoning:

1- If the some of the urls/access-points are ready to consume for getting bytes (public urls), there is no further action needed to call another endpoint.

2- For cloud URIs, the URIs returned from GET /object/<id> is enough if the consumer has direct bucket access privileges for the private bucket in question. They also wouldn't need to sign the url. So it adds a value to have those URIs there for this case.

3- Having GET /object/<id>/download?foo=bar primarily solves the case where a file belongs to a private bucket, and the consumer HAS TO sign a URL for that file in order to access it, without having direct access to the whole bucket. This is the majority of the use-cases for many datasets that have a controlled-layer of access.

4- GET /object/<id>/download?foo=bar can be used for any other case where the urls returned from GET /object/<id> is not readily-consumable.

To sum up above, I do really agree with you that:

GET /object/<id> takes an as input and returns an array of access-methods

and

the second method (GET /object//download) takes an and an as input, and returns an URL_TO_BYTES uri.

sounds more reasonable. That way, the implementors MAY choose not the include the URIs in GET /object/<id> if there is no added benefit. But having the URIs there have a point as I tried to explain above.

I agree that DRS implementors can choose to require an auth token before they respond to a GET request for metadata, and they can return an "access denied" if appropriate. We should document that somewhere, as we do for WES, but I think that auth policy and those tokens are completely separate from the policy and tokens used to call the /download method, or passed to the URL_TO_BYTES uri.

+1 for this. There could be separate auth policies for both cases, and implementors MAY choose to have the same policy for both if it fits them. I guess this is not against your point.

I agree with @brucehoff that the auth token passed to /download feels like standard HTTP auth, and should be a standard HTTP auth token. (Since the only question is whether you're allowed to fetch the bytes; there's no use case I can imagine for "it's okay to call the download method but it's not okay to download the object".)

+1 for standard HTTP auth token or a bearer token. One of the possible cases would be that your authz privileges might have been revoked or expired, but you already obtained a token. In that case, the flow is solid anyway, you would get a 403 on GET /object/<id>/download

Why would a call to DRS return an access-method of type DRS? (As in your fourth example.) I'm not following the use case that addresses.

During the hackathon we were made aware that there are some implementors who will be using DRS as a data-registry service, without necessarily providing a direct access to bytes, but instead pointing out to a another DRS server (via a DRS url) where the data can be accessed. Sort of like "linked DRS"es or DRS of DRSes. @susheel could you please perhaps provide those uses cases?

Are you proposing that id, name, and size the only metadata fields we support? Or is that just a placeholder in this issue, and we can figure out the actual fields in a separate issue?

Oh no, it was just a placeholder since I didn't want to type every other metadata fields, hence the .... Sorry if it was misleading. Related to this, I'm all up for bringing this to another issue where we define the required fields in GET /object/<id>, based on the break-out session on 1st day of hackathon. Current model on swagger seems out of date.

/download feels a little off to me, especially for cloud-to-cloud use cases. Maybe /bytes or /body instead? I'm open.

Same here, open for any ideas. The most difficult part of building anything is to name it =) /bytes, /access?

briandoconnor commented 5 years ago

From the GA4GH call today, @sarpera and @susheel discussed what happens with a DRS entry for an object when you call GET id/download... OK to not implement seems to be the consensus.

Seems like we need to clarify how "/download" works for the various URI types

@dglazer proposed get bytes URI, fetch bytes ID... for passing to the download method. so the ID -> URI

@sarpera is going to take this ticket and make a PR that explores what he and David talked about today... sort out the URL and the download in a single PR. @dglazer @rishidev and I will work out a process to bring this and other PRs up to vote via the active drivers

susheel commented 5 years ago

Need to clarify /download for non-cloud (legacy) data endpoints, e.g:

ftp
gsiftp
globus
aspera

sarpera commented 5 years ago

Wrapping up so far

Good to see that there is a general consensus on the main idea that accessing "object metadata" and "bytes to object" may be separate calls to DRS for the cases where an "action" is required to be performed to get access to bytes e.g passing an auth token to: sign a URL, generate url-to-bytes with credentials etc.

By doing so, I guess we all agree that the schema should remain generic, flexible and understandable, yet providing programatically parsable responses for the clients with different needs and use-cases. With that in mind, I tried to combine our ideas together and here's the outcome:

Object metadata: GET https://example.com/ga4gh/drs/v1/objects/<id> Returns metadata of an object, with the set of access-methods.

Object bytes: GET https://example.com/ga4gh/drs/v1/objects/<id>/download/<access-method-id> Returns a "uri-to-bytes" for a given access-method-id, if it exists.

Examples

GET https://example.com/ga4gh/drs/v1/objects/<id>

Response:

{
  "object": {
    "id": "foo",
    "name": "bar.bam",
    "size": "1234",
    "urls": {
      "s3": [
        {
          "uri": "s3://foo/bar.bam",
          "region": "us-east-1",
          "<access-method-id>": "s3-1"
        }
     ],
     "gs": [
        {
          "uri": "gs://<foo>/<bar>.bam",
          "region": "us-west1",
          "<access-method-id>": "gs-1"
        }
      ],
      "ftp": [
        {
          "uri": "ftp://foo.com/bar.bam"
        }
      ],
      "drs": [
        {
          "uri": "drs://foo.com/objects/<id>"
        }
      ]
    }

GET https://example.com/ga4gh/drs/v1/objects/<id>/download/<access-method-id> with Request Header Authorization: <string>

Response:

{ "uri": "<uri-to-bytes>" }

Let's break apart the suggested urls property of an object:

"urls": {
  "<access-method>": [
    {
      "uri": "<string>",
      "<access-method-specific-attr>*": "<value>"
    }  
  ]
}

where

Questions

Why have a key-value paired access methods? So that a specific <access-method> can have its own properties in the schema model.

E.g: in the cloud scenarios, there is a huge added value of having the region information for a URI whereas for other <access-method>s that property may be meaningless. Swagger schema model should set the expectations for client to consume this information programatically.

Why the value of <access-method> is an array? One access method can have multiple URIs with different attributes. E.g: data duplicated on different regions on the same cloud provider:

In the example below, access-method s3 has two entries.

"urls": {
  "s3": [
    {
      "uri": "s3://foo/bar.bam",
      "region": "us-east-1",
      "<access-method-id>": "s3-us"
    },
    {
      "uri": "s3://baz/bar.bam",
      "region": "eu-central-1",
      "<access-method-id>": "s3-eu"
    }
  ]
}

What if all the <access-method>s are public or readily consumable? Then the <access-method> wouldn't have a <access-method-id> property to begin with. Any calls to non-existing /download/<access-method-id> would return 400.

Example:

"urls": {
  "ftp": [
    {
      "uri": "ftp://foo/bar.bam"
    }
  ]
}

How to mint an <access-method-id>? As long as it's url-encoded and unique per object, it can be any string value, up to the implementor.

Help needed with naming things!

How to name <access-method-id> property? id? fetch-id? access-id?

Naming the suggested new path, currently "download" Some people raised concerns about calling it download. Any ideas? bytes? fetch?

TODOs

Will make a PR with the suggested changes reflected on the swagger schema.

mattions commented 5 years ago

On the naming side I propose:

access-method-id to keep it like it is
swap download for access

so the Urll will look like: https://example.com/ga4gh/drs/v1/objects/<id>/access/<access-method-id>

So something like this:

{ "object": 
    { "id": "foo", 
      "name": "bar.bam", 
      "size": "1234", 
      "urls": { "s3": [ 
                { 
                    "uri": "s3://foo/bar.bam", 
                    "region": "us-east-1", 
                    "access-method-id": "s3-1" 
                 } 
               ], 
                "gs": [ 
                    { "uri": "gs://<foo>/<bar>.bam", 
                      "region": "us-west1", 
                      "access-method-id": "gs-1" 
                      } 
                    ], 
    }
}

will have the following allowed calls:

# for s3
GET https://example.com/ga4gh/drs/v1/objects/foo/access/s3-1
# for gs
GET https://example.com/ga4gh/drs/v1/objects/foo/access/gs-1

with Request Header Authorization: <string>

If we have consensus, we can move this next with the PR

dglazer commented 5 years ago

_[updated to rename uri to access_uri, and to use underscores]_ Thank you @sarpera for incorporating everyone's input -- I personally think this is very close, and ready to move to a PR for final discussion. I'm fine with /access as the URL for the second method. I have a few suggestions on the response to the first method, GET /objects/<id>:

I suggest not returning a URI unless you can actually use it to fetch bytes. That means the DRS implementor can choose what kinds of access are supported -- if they return a uri the caller uses it to fetch bytes directly (and is responsible for knowing what if any auth tokens they need to pass), if they return an access_method_id the caller uses it to call the /access method, and if they return both (which I think will be rare) the caller can choose.
I suggest renaming urls to access_methods, which better matches the actual array elements.
To support access methods that have multiple entries (as in your S3 example), I slightly prefer flattening things by making the top-level an array, and allowing multiple entries for any given type of access. But that's just taste; I'll go with whatever most people think feels natural.
I slightly prefer access_id to access_method_id, just because it's shorter. Again, I'll go with the sense of the crowd.
I suggest renaming uri to access_uri, which makes it very clear what it's for. (And is nicely parallel to access_id.)

Incorporating my proposals, your example would look like:

"access_methods": [
   "s3": {  # there's no uri, meaning the caller has to call /access before fetching bytes
      "region": "us-east-1",
      "access_id": "s3-us"  
   },
   "s3": {
      "region": "eu-central-1",
      "access_id": "s3-eu"
   },
   "gs": {  # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
      "region": "us-west1", 
      "access_uri": "gs://foo/bar.bam", 
      "access_id": "gs-1" 
      },
   "ftp": {  # there's no access_id, meaning the caller has to fetch the bytes directly
      "access_uri": "ftp://foo.com/bar.bam"
   }
]

sarpera commented 5 years ago

Thanks @dglazer, I'm also happy to see that PR I'm setting was actually pretty close to your input.

@mattions 2. swap download for access @dglazer I'm fine with /access as the URL for the second method.

I agree. Already used /access to be a new path in my local changes for the PR.

I suggest renaming urls to access-methods, which better matches the actual array elements.

I agree. In my git local changes I already set to it be access_methods following the naming convention on the yaml.

To support access methods that have multiple entries (as in your S3 example), I slightly prefer flattening things by making the top-level an array, and allowing multiple entries for any given type of access. But that's just taste; I'll go with whatever most people think feels natural.

After diving into the yaml code, I also figured having an array will make things a bit easier to describe via swagger v2.0. Also, opens future possibilities to make the items in the array more searchable in a uniform way. I'm swaying away from s3: {foo: bar} idea, seems like it won't be a general-enough solution to encompass future cases e.g Azure, Ali cloud and wherever else protocol/pseudo-protocol people might use to point to their data. @zflamig also pointed out the issues with extensibility aspect of it, I agree with that. See the notes below.

I slightly prefer access-id to access-method-id, just because it's shorter. Again, I'll go with the sense of the crowd.

I agree. Already used access_id in my local changes for the PR.

I suggest not returning a uri unless you can actually use it to fetch bytes. That means the DRS implementor can choose what kinds of access are supported -- if they return a uri the caller uses it to fetch bytes directly (and is responsible for knowing what if any auth tokens they need to pass), if they return an access-method-id the caller uses it to call the /access method, and if they return both (which I think will be rare) the caller can choose.

As you mentioned, there are use cases (perhaps rare) that you might have both uri and access_id at the same time. Example case: a controlled-access file stored in a cloud bucket will have a URI s3://foo/bar.bam which might be enough for a user who has a direct bucket access, but for the ones who don't have a direct bucket access, they will rely on signing a URL via access_id. But the schema would require having the access_id property to be used in /access/<access_id> path.

GET /objects/{object_id}/access/{access_id} via Authorization Request Header

Retrieve a URL to access bytes of an (controlled-access) Object

Response:

{
  "url": "string"
}

GET /objects/{object_id}

Response:

{
   "object": {
      "id*": "string",
      "name*": "string",
      # ... rest of the properties
      "access_level*":  "open | controlled",
      "access_methods*": [
         {
            "uri*": "string",
            "access_id*": "string",
            "cloud_metadata": {
               "region*": "string",
               "provider*": "string"
            },
            "protocol*": "string"
         }
      ]
   }
}

Some notes:

access_level It would help to be programatically explicit about the access level for a data object to set expectations on accessing bytes. Since all of the access_methods would conform to this, could be a property of a data object. Values could be enumerated as either open or controlled. @dglazer, I guess it also relates to this issue: https://github.com/ga4gh/data-repository-service-schemas/issues/25
uri Does anyone have a strong opinion not to have this field as required?
access_id URL-encoded identifier for an access point, to be used in /access/<access_id> path.
cloud_metadata Optional field for cloud access methods. Motivation for moving this as an object to have a more modular model for an access_method, as well as extending child properties for future needs e.g zone, egress cost details etc.
region Won't make sense to enumerate values since it is cloud provider dependant. Though the format somehow should follow the naming convention of a particular provider e.g us-west-1 in aws vs us-west1 in gcp. Will investigate to see if there is a formal spec for formatting this.
provider This would help findability for the cases where an object has a lot of mirrors in the cloud, and also to set expectations of region format. E.g ali, aws, gcp, azure etc
protocol Wasn't sure how to name this right. Idea is to enumerate URI "types?" of access methods in order to set expectations programatically. E.g ftp, s3, gs, http, https, drs, globus etc etc Values won't be likely to conform to rfc7595, since we have cases for using psedo-protocols.

bwalsh commented 5 years ago

Great discussion. It is rewarding to see this work move forward.

Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost.

How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]?

Also, I'm assuming the checksums object is part of '# ... rest of the properties' ?

Forgive me if I've missed it, but are there formal dependencies to a (probably separate) Search Service to query this data?

BTW, I always thought that urls was misnamed, nice to see it morph to access_methods.

philloooo commented 5 years ago

@sarpera I'd like to remind a point that Zac and me mentioned in previous comments, for the drs access_method, it needs a provider field so we are not limiting the drs uri to put the hostname in the identifier.

        {
          "uri": "drs://<someid>",
          "provider": "drs.example.org",
        }

ddietterich commented 5 years ago

I think we need to put a DNS name in the DRS URI. Otherwise, we have to get into the business of service resolution. I don't have much appetite for boiling that ocean.

dglazer commented 5 years ago

@sarpera -- glad we're converging. Happy to hash out the remaining details in the PR, but in case it's helpful here are a few thoughts on your latest comment:

What do you think about using access_type for protocol ? Then every method would have an access_type and at least one of an access_uri and/or access_id.
I don't understand why we need a provider field in addition to access_type -- when would those be different? (e.g. I think provider = "AWS" if and only if access_type="s3".)
Re uri -- I feel pretty strongly it shouldn't be required for methods where the caller can't use it (e.g. for unsigned s3 uris in an implementation where only signed urls are allowed).
Re access_level -- I'd prefer to leave that out of the spec, at least for now -- I expect different implementations will have a wide range of needs here, so it's a good area for pre-standard experimentation. (But I'd be interested to hear which of our implementors expect they would use it.)
A few related thoughts about how strongly- or weakly- typed we want to be:
- I have mixed feelings about the "pseudo-typing" we end up with by having the expected sub-properties of an access_method object depend on the value of the access_type. (For example, we expect a region if the type is gs or s3, and not if it's ftp or aspera.)
- I have mixed feelings about nesting the cloud_metadata properties in a sub-object. Is the idea that we have strong opinions about a set of properties that will be shared across several access types (e.g. s3, gs, Ali, Azure), but won't be shared with other access types?
- At one extreme we'd end up with a single generic access_method object containing the superset of all the properties we think we'll see. (E.g any type can include a region property; types that don't need it ignore it.) At the other extreme we have a different fully-typed object for each access type, with different types often using the same name for some of their properties. (E.g. s3 and gs types would both be defined to include a region.)
- Note that in all cases, I expect individual implementors of v1 to experiment with new pre-standard properties that make sense in their environment, but aren't yet widely agreed on. (We could even encourage an x-property naming convention if we wanted.) That means that we don't have to get v1 perfect, and we can incorporate widely-used new properties in v2.
- I don't love any of the options I've thought of; at the moment I lean toward the single object with the superset of all properties. I'm least comfortable with having partially-shared properties used by some but not all types, as proposed with cloud_metadata.

susheel commented 5 years ago

Extending @dglazer's example, why can't we be explicit as below?

GET /objects/{object_id}

"access_methods": [
   "s3": { 
      "access_id": "drs://server.com/access/s3-us", 
      "region": "us-east-1",
   },
   "s3": {
      "access_id": "http://server.com/get_object/s3-eu", 
      "region": "eu-central-1",
   },
   "gs": {  # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
      "region": "us-west1", 
      "access_uri": "gs://foo/bar.bam", 
      "access_id": "drs://server.com/access/gs-1" 
      },
   "ftp": {  # there's no access_id, meaning the caller has to fetch the bytes directly
      "access_uri": "ftp://foo.com/bar.bam"
   }
]

This way implementors can also point the user to use external services to implement the /access method.

I'm personally agianst baking the /access method into the DRS specification, but I may be in the minority and happy to commit if the community goes this way. If we do, we need to agree on #214 as there will many mechanisms to get signed URLs.

I'm still unclear how this will work for non-cloud private data URLs (FTP, GSIFTP, etc.). Could "contact": "John Doe <john@doe.com>" be added to each access_method class? Or can we have a pseudo access_id that returns the contact info for private FTP, GSIFTP, etc.?

dglazer commented 5 years ago

@philloooo -- I suggest we split discussion of the drs access_method into a separate issue/PR; I don't think the details of that discussion will affect the the outcome of this discussion.

dglazer commented 5 years ago

@bwalsh , re a separate Search Service -- my mental model is that there aren't any formal dependencies, but there's an expectation that, once a search/discovery API is defined, many callers of DRS will use it to get the object ids they pass in to DRS.

zflamig commented 5 years ago

@dglazer @sarpera

I don't understand why we need a provider field in addition to access_type -- when would those be different? (e.g. I think provider = "AWS" if and only if access_type="s3".)

This is explicitly being driven by a GDC/DCF use case where we run on-premise object storage systems that use an S3 compatible API so we need to know if the file is actually on Amazon or our local object storage. We could potentially overload the region to capture it too, but just having a provider field seems more clean.

dglazer commented 5 years ago

Thanks @zflamig for the explanation -- my quick reaction is that's cleaner to represent as either a pseudo region (if from the caller's point of view it behaves exactly like S3 except the bytes are in a physically different place), or a different access_type (if the caller needs to be aware of more method-specific differences). Basically I'd rather keep the API itself simpler for the majority of callers, and have implementation-specific needs fit into implementation-specific extensions. Thoughts?

zflamig commented 5 years ago

@dglazer In general I agree, but in practice with thinking how clients might actually interact with this information I feel like keeping it separate is easier. For example, a client that only knew how to support AWS S3 would have a very simple check on the provider to see if its the AWS hostname. When doing the region's they would have to know a list of all the current AWS regions to know if the listed region is real or not.

Ultimately, I'm happy either way so long as we agree to support this in some fashion. I just have a strong preference towards clients being able to write stable code that is easy to test and doesn't need to be updated when AWS adds new regions.

tetron commented 5 years ago

My $.02 the method, provider, and region should be separate. For example, it might be cheaper to transfer between two Google cloud regions than to transfer between AWS and Google cloud regions that are physically closer. (On the other hand, physically closer might be faster).

As I mentioned at the F2F, a client library that implements a preference matrix for deciding how to fetch data would probably help shed some light on the best way to represent this.

sarpera commented 5 years ago

Thanks for all the feedback!

@bwalsh

Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost.

Exactly, this is what drove us initially to use a defined language to describe the access methods.

How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]?

One way to go for it is to have strongly typed schema model per an access method and enforce required params thereof. Schema model for those Individual access methods should organically evolve when we get more use-cases iterated over in time.

Also, I'm assuming the checksums object is part of '# ... rest of the properties' ?

Yes. Wanted to skip details since there is an issue for that already.

@zflamig @philloooo SevenBridges also has the same use case as Zac defined for cloud URIs. I'll suggest adding provider only for cloud-related access methods. Those for the DRS urls, I agree with @dglazer that the resolution of DRS URL might get tricky if they are coupled with provider info. For the cases Zac defined, any strong opinions against having provider string in s3, gs etc? See the updated model below.

@susheel Yes, being explicit seems like where most of us align. I also agree that #214 needs to agreed upon.

The strong case, at least for us, to push for /access is that our DRS server will be the same service who'll provide signed URLs on demand for private cloud resources in DRS. Avoiding this would make DRS urls pointing to a controlled-access data quite useless on their own. With the proper authN/Z in place, DRS would provide access to private/protected/public data in our case, making DRS URLs programatically actionable, especially for WES scenarios.

I'm still unclear how this will work for non-cloud private data URLs (FTP, GSIFTP, etc.). Could "contact": "John Doe <john@doe.com>" be added to each access_method class? Or can we have a pseudo access_id that returns the contact info for private FTP, GSIFTP, etc.?

It would help greatly if you could provide a complete use case for non-cloud private data URLs.
How is the data actually accessed after a user is given privileges? What happens when I contact the author and I'm given access somehow? Do I add my credentials to the URI of the ftp file? If we add contact info per access method, how would we utilise this programatically? Overall, it would be awesome to cover as much cases as possible with the idea behind /access or shape it differently based on the use cases.

@tetron

My $.02 the method, provider, and region should be separate. For example, it might be cheaper to transfer between two Google cloud regions than to transfer between AWS and Google cloud regions that are physically closer. (On the other hand, physically closer might be faster).

Agreed. Since we seem to go for strongly typed access methods, we could make cases for cloud-related methods to provide these information. Seven Bridges and @zflamig also have use-cases for explicit provider property. Note that it doesn't affect the original proposal i.e /access/<access-id>

Schema

Object has access_methods property which is an array of AccessMethods:

 "access_methods": <AccessMethod>[]

where an AccessMethod is:

{
    "<x>": <xAccessMethod>
}

DRS defines the values for x and their corresponding schema models i.e xAccessMethod in the specification.

Example AccessMethods:

{
    "s3": {
        "uri*": "string",
        "access_id": "string",
        "region*": "string",
        "provider": "string",
        "allowed_regions": [
            "string"
        ]
    }
}

{
    "drs": {
        "uri*": "string"
    }
}

{
    "ftp": {
        "uri*": "string"
    }
}

Example response of an object:

{
    "object": {
        "id": "1234",
        "name*": "bar.bam",
        # ... rest of the properties
        "access_methods": [
            {
                "s3": {
                    "uri": "s3://foo/bar.bam",
                    "access_id": "s3-1",
                    "region": "us-west-1",
                    "provider": "s3.amazonaws.com",
                    "allowed_regions": [
                        "us-west-1", "us-east-1"
                    ]
                }
            },
            {
                "gs": {
                    "uri": "gs://foobaz/bar.bam",
                    "access_id": "gs-1",
                    "region": "us-central1",
                    "allowed_regions": [
                        "us-central1"
                    ]
                }
            },
            {
                "ftp": {
                    "uri": "ftp://foo.org/baz/bar.bam"
                }
            },
            {
                "drs": {
                    "uri": "drs://some-other-drs.org/9876"
                }
            }
        }
    ]
}

Initial idea of having strongly typed access methods seems to be favoured by most of us. Please note that with this approach, in order to add a new access method we'd need to define it and update the schema. It is of course expected to have more properties in said AccessMethods. Now that we seem to agree on being explicit about individual access methods, there is a room for that. IMHO it adds value in the long run when it comes to setting expectations for the DRS consumer/client to parse this information.

Thoughts?

tetron commented 5 years ago

What is allowed_regions ?

Why {"s3": { ... } } instead of {"method": "s3", ...} ?

What is the difference between uri and access_id ? I see they are different here but is there a reason it can't just provide uri when using the /access/ endpoint?

For the ftp case, perhaps the provider should be the ftp host? I think it is okay if region is sometimes null but I think provider should always be filled in, even if it is just a hostname.

I am thinking about the client's decision matrix. I think we want a tuple of (method, provider, region) and the client assigns a preference or weight to each tuple based on (a) availability of credentials and (b) function of cost and expected transfer speed.

For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?)

For the private access case, the client can have a table of credentials that correspond to various combinations of (method, provider, region) (could include wildcards.)

susheel commented 5 years ago

@sarpera For the ftp AccessMethod, see example below

{
  "method*": "ftp",
  "provider*": "string"
  "uri*": "string",
  "region": "string",
  "contact": "string"
}

Fully realised example:

{
  "method": "ftp",
  "provider": "ftp.ebi.ac.uk"
  "uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
  "region": "null",
  "contact": "Contact John Doe <john.doe@example.com>"
},
{
  "method": "ftp",
  "provider": "ftp-private.ebi.ac.uk"
  "uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
  "region": "ebi-hh",
  "contact": "Contact Jane Doe <jane.doe@example.com>"
}

I'm guessing it would be the same for gridftp, sftp, Globus and Aspera will be a little complicated - I would need to think about this a little more.

susheel commented 5 years ago

@sarpera Do you see the possibility of having a local AccessMethod too. Example below:

{
  "method*": "local",
  "provider*": "string"
  "uri*": "string",
  "region": "string",
  "contact": "string"
}

Fully realised example:

{
  "method": "local",
  "provider": "ebi-cluster.ebi.ac.uk"
  "uri": "file://public/path/file",
  "region": "ebi-hx",
  "contact": "Contact John Doe <john.doe@example.com>"
},
{
  "method": "local",
  "provider": "ebi-yoda.ebi.ac.uk"
  "uri": "file://private/path/file",
  "region": "ebi-hh",
  "contact": "Contact Jane Doe <jane.doe@example.com>"
}

sarpera commented 5 years ago

@tetron

What is allowed_regions ?

Buckets can be set to incur outbound (egress) costs outside of its region in the same cloud provider. This provides more information in the decision making process to pick the most appropriate mirror of the file. Perhaps not the best name for the attribute though.

Why {"s3": { ... } } instead of {"method": "s3", ...} ?

The former allows to define a schema model per access method so that method-specific attributes can be defined and enforced for consistency. Happy to discuss if the same goal can be achieved in a different way.

What is the difference between uri and access_id ? I see they are different here but is there a reason it can't just provide uri when using the /access/ endpoint?

@dglazer also made some points about it. There may be cases where for a specific access method URI may not give any means of access e.g a file residing in a VPC and the only means of providing access to third-parties is signing a URL via /access method. We keep both attributes, for the cases like having both direct bucket access and option to sign a URL on demand, depending on the consumer of the object. Then both attributes in fact can provide access and meaningful to have.

Please also note that the cloud data owners may not want to (or be allowed to) expose their bucket names in the URIs, but may provide access via /access/<access> with the proper authZ.

We can pursue some additional capabilities where; while keeping /access/<access_id>, providing /access?uri="<uri_string>". But IMVHO having a dedicated path like /access/<access_id> is much cleaner and approachable considering above cases.

I am thinking about the client's decision matrix. I think we want a tuple of (method, provider, region) and the client assigns a preference or weight to each tuple based on (a) availability of credentials and (b) function of cost and expected transfer speed.

This is a very important point and setting the individual access method attributes by aiming that goal would help us achieve that. I hope this aligns with your second question and answer I tried to provide.

For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?)

Yes, exactly. Having that dedicated path /access/<access_id> will make this clear in the schema since the attribute required to craft this path will be enforced. This partially also answers your 3d question. Happy to explore alternative approaches if this isn't intuitive.

For the private access case, the client can have a table of credentials that correspond to various combinations of (method, provider, region) (could include wildcards.)

Could you please explain this a bit more with examples? Are you talking about discoverability of the available access methods based on existing client conditions?

@susheel thanks for the examples.

For the ftp case, perhaps the provider should be the ftp host? I think it is okay if region is sometimes null but I think provider should always be filled in, even if it is just a hostname.

Based on the previous schema definitions I provided, each defined access method would have its own attributes based on its needs. So region would be null for ftp cases. Similarly if the provider doesn't add any information, we don't need to add that for ftp case. So we could do something like this:

access_methods: [
    { 
        ftp:    {
            "uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>"
        }
    },
        ftp:    {
            "uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>"
        }
    }
]

@sarpera Do you see the possibility of having a local AccessMethod too. Example below:

local could be a new access method then I presume. Let's gather more use cases to define its attributes.

susheel commented 5 years ago

@sarpera For ftp I would agree with @tetron's previous comment:

For the ftp case, perhaps the provider should be the ftp host? I think it is okay if region is sometimes null but I think provider should always be filled in, even if it is just a hostname.

Having a provider set for all access_methods even if it just a hostname would make filtering easier, so I would add this into your example above.

region should be available to the ftp access method, e.g. "region": "ebi-hx" or may be optionally set to null, e.g. when behind a loadbalancer.

sarpera commented 5 years ago

@susheel making filtering easier by using provider is a solid point. I guess the only argument against it was to couple the provider and URI resolution for the DRS case. I'm all up for having a provider field as long as it's not coupled with means of accessing, which should be a job of uri field or /access/<access_id> method.

So updated example would be:

access_methods: [
    { 
        ftp:    {
            "uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>",
            "provider": "ftp.ebi.ac.uk"
        }
    },
        ftp:    {
            "uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>",
            "provider": "ftp-private.ebi.ac.uk"
        }
    }
]

I feel like contact property could be defined strongly. Maybe more structured? We could be more explicit about the value so set expectations right. Perhaps a field for just email as a value, or a field for ORCIDs? Just to avoid open-ended, vague string values.

region should be available to the ftp access method, e.g. "region": "ebi-hx" or may be optionally set to null, e.g. when behind a loadbalancer.

Is region a known terminology for ftp cases? In the cloud cases, the meaning of the property is quite well-established. Naive question; how would region affect decision making on picking the right access method for FTP-like URIs?

sarpera commented 5 years ago

Seems like current version of OpenAPI doesn't allow patternProperties as of 3.0.2. Unless we want to hardcode all available access methods (ftp. http, s3 etc) in the schema and pair them with a schema model (array of objects), the above approach won't work in practice.

@tetron

Why {"s3": { ... } } instead of {"method": "s3", ...} ?

Going back to this approach, seems like with open api v3 this can be achieved while still enforcing certain properties per access method (region for cloud methods etc) by making use of anyOf when defining the access_methods array.

I'd like to recap our requirements so far about the access methods before adjusting them to v3.0.

access_methods

required prop. of an Object
lists available access methods for an object
array with minimum 1 item

And so far our use cases for an access method are:

Open access data in the cloud (public access buckets)
Private/controlled access data in the cloud (VPC, private buckets etc)
Open/controlled access files on ftp, gsiftp, globus, aspera etc
DRS of DRSes (URI is another DRS URI)
Data in a local file system (URI is a local path)

Required / Optional bare minimum properties are different:

Use Case	`URI`	`method`	`region`	`access_id`	`provider`	`contact`
Cloud - open	R	R	R	O	O	O
Cloud - controlled/private	O	R	R	R	O	O
FTP, HTTP, globus, aspera etc	R	R	-	O	O	O
DRS of DRSes	R	R	-	O	O	O
Local	R	R	-	?	O	O

Based on that, examples would look like:

Open access data:

"access_methods": [
      {
        "uri": "ftp://foo.example.com/file.name",
        "method": "ftp",
        "provider": "foo.example.com"
      },
      {
        "uri": "s3://foo-open-bucket/file.name",
        "method": "s3",
        "provider": "s3.amazonaws.com",
        "region": "us-east-1"
      }
],

Controlled/private access data:

"access_methods": [
      {
        "uri": "ftp://bar.example.com/file.name",
        "method": "ftp",
        "provider": "foo.example.com",
        "contact": "foo@example.com"
      },
      {
        "access_id": "s3-1",
        "method": "s3",
        "region": "us-east-1"
      },
      {
        "uri": "drs://foo.example.com/123",
        "method": "drs"
      }
],

Enforcing required/optional properties can be done via anyOf and enumerated values for method property in the schema. (please not that inheritance in OpenAPI does not allow overriding required properties, hence some redundancies bellow)

    AccessMethods:
      type: array
      description: The list of access methods that can be used to access the Data Object.
      minItems: 1
      items:
        anyOf:
        - $ref: '#/components/schemas/StaticAccessMethod'
        - $ref: '#/components/schemas/CloudAccessMethod'
        - $ref: '#/components/schemas/ActionableStaticAccessMethod'
        - $ref: '#/components/schemas/ActionableCloudAccessMethod'
        discriminator:
          propertyName: method

    ActionableAccessMethod:
      type: object
      required:
        - access_id
      properties:
        access_id:
          type: string

    ActionableCloudAccessMethod:
      type: object
      allOf:
        - $ref: "#/components/schemas/ActionableAccessMethod"
        - type: object
          required:
            - region
            - method
            - access_id
          properties:
            uri:
              type: string
            method:
              type: string
              enum:
                - s3
                - gs
            region:
              type: string
              description: >-
                Name of the region in the cloud service provider that the object belongs to.
              example:
                us-east-1
            provider:
              type: string

    CloudAccessMethod:
      type: object
      required:
        - uri
        - region
        - method
      properties:
        uri:
          type: string
        provider:
          type: string
        method:
          type: string
          enum:
            - s3
            - gs
        region:
          type: string
          description: >-
            Name of the region in the cloud service provider that the object belongs to.
          example:
            us-east-1

    ActionableStaticAccessMethod:
      type: object
      allOf:
        - $ref: "#/components/schemas/ActionableAccessMethod"
        - $ref: "#/components/schemas/StaticAccessMethod"

    StaticAccessMethod:
      type: object
      required:
        - uri
        - method
      properties:
        method:
          type: string
          enum:
            - ftp
            - sftp
            - http
            - https
            - nfs
            - globus
            - aspera
            - gsiftp
            - nfs
            - local
        uri:
          type: string
        provider:
          type: string
        contact:
          type: string

susheel commented 5 years ago

@sarpera Thanks for investigating the OpenAPI spec compatibility. I agree with @tetron it would have been cleaner, but I guess we will have to live within our means! :)

I thought we'd discussed (maybe not agreed) that we will be more explicit with the access_id to be a uri. E.g.

Controlled/private access data:

"access_methods": [
      {
        "access_id": "drs://server.com/access/s3-1",
        "method": "s3",
        "region": "us-east-1"
      }
      {
        "access_id": "http://server.com/get-object/s3-1",
        "method": "s3",
        "region": "us-east-1"
      }
],

Which I hope will work for your use case when it is provided by the DRS service, and when it may be provided by a third-party service.

P.S. If this is acceptable, why have it called access_id, we could just call it uri

sarpera commented 5 years ago

@susheel access_id was made explicitly to be used in this path /objects/<id>/<access_id>, which will generate or return a ready-to-use (signed url, url with encoded credentials etc) URL to bytes, via Authorization request header. It should not be used interchangeably with uri, they serve a different purpose.

In your example,

{
   "access_id": "drs://server.com/access/s3-1",
   "method": "s3",
   "region": "us-east-1"
}

Please note that access_id is unique for an object, not per DRS server unlike the DRS URL. So it would have to be drs://server.com/objects/<object_id>/access/s3-1.

Keeping that in mind, method and region info is redundant, since the same info would be available on drs://server.com/objects/<object_id>/. And if the file is moved to another region, you'd need to have a mechanism to ping the DRS who links it and update the redundant info.

Also it's ambiguous what token value for Authorization request header should be used in the linked DRS server URI, since it could be a different value.

If with DRS of DRSes we are aiming to redirect the client to another, this is indirect but not ambiguous:

{
   "uri": "drs://server.com/<object_id>",
   "method": "drs"
}

Alternatively, we could utilise the alias property of an object, to link/mirror another DRS URLs. GET /objects/<id>

{
   "id": 123,
   "name": "foo",
   "checksums": ["# list here"],
   "access_methods": ["# list here"],
   # rest of the props
   "alias": ["drs://server.com/<object_id>"]
}

dglazer commented 5 years ago

1) @sarpera, thanks for continuing to deep dive into OpenAPI syntax. Given what you found, I agree with the general direction of your latest proposal. Notes:

Do we need to separate the Actionable... methods from the others? IIUC (which I may not), the idea is to distinguish one-step from two-step access. But I believe all the properties are the same in the two cases, except for whether access_url (one-step) or access_id (two-step) are expected. So if our OpenAPI just always allowed both, and our documentation said "you must provide at least one", we'd be fine, and the spec would be easier to maintain.
- I don't think region should be required. If servers don't want to specify a region (either because they don't offer choice, or because they're using multi-region storage), and callers are okay with that, we should allow it.
- I still don't understand why we need provider, but could be missing something. Every use case I can think of where the caller actually cares about provider feels better modeled by introducing a new method. Does someone have a good counterxample? @susheel, I know you said that _"Having a provider set for all access_methods even if it's just a hostname would make filtering easier"_, but I don't understand why -- do you have an example in mind, where filtering by method wouldn't be just as good?
- I have mixed feelings about contact, since I don't know what clients are supposed to do with it. I can picture some possible value to the developers building a client, but that seems like an odd thing to put into the mainstream API, vs. (e.g.) the /service-info endpoint. I suggest we leave it out for now, and then if someone feels strongly we can open a separate issue/PR to discuss.
- I suggest using access_url instead of uri, to make it clear that you use it to fetch the actual object bytes, as opposed to fetching some intermediate thing
- I suggest always listing the method first in your examples. Syntactically it's the same thing, but it makes it more readable, since it's probably the first thing the caller cares about, and it tells the caller what other parameters to expect.

2) @susheel, re the format of access_id -- it did come up earlier, but I don't think we reached consensus. I think we all agree that the end goal is for callers to get an access_url, which they can use to directly fetch object bytes; the question is how they get that access-url.

The simple case is when only a single step is needed (e.g. for public content); in that case the server can return an `access_url` directly. The trickier case is when two steps are needed (e.g. for signed URLs).

The two-step pattern I prefer, mostly because the behavior feels more explicit, is:
  - servers can return an opaque `access_id` string, in any format they choose
  - callers pass that `access_id` to a well-defined method on the same server (e.g. `/objects/<id>/access/<access_id>`), which returns the `access_url`

I believe the pattern you're suggesting is:
  - servers can return an `access_url_url` (_name TBD_) string, which must be a fully resolvable HTTP GET'table path, and can be on any server  
  - callers do an HTTP GET on the `access_url_url`, which returns the actual `access_url`

**Is that right?** If so, I'm open to discussing the tradeoffs. But I agree with @sarpera that we shouldn't mix the patterns, and call the string an `access_id` if it's actually a fetchable URL.

3) I suggest we split discussion of the drs access_method into a separate issue/PR. I think whatever we end up with here will be able to support that use case, and it will be cleaner to discuss it separately. (I for one am still not clear on the requirements, but don't want to side-track this thread to dive in.)

dglazer commented 5 years ago

As discussed in #230, updating to OpenAPI 3.0 may take longer than we'd like, and I'm eager to get the changes discussed here into a PR. So it may make sense to decouple the issues, open up a PR for this issue now doing the best we can using v2, and then revisit whenever #230 is resolved.

I think that will be fine -- we can still use the access_methods syntax you propose above, but with weaker typing in the OpenAPI definition, so some of the rules for what parameters are valid where would be enforced by policy, not by schema. Not as elegant, but perfectly functional, and we can upgrade to stronger typing later.

I picture something like (without having tested it):

    AccessMethods:
      type: array
      description: The list of access methods that can be used to access the Data Object.
      minItems: 1
      items:     
        $ref: '#/components/schemas/AccessMethod'

    AccessMethod:
      type: object
      required:
        - method
      properties:
        method:
          type: string
          enum:
            - s3
            - gs
            - ftp
            - sftp
            - http
            - https
            - nfs
            - globus
            - aspera
            - gsiftp
            - nfs
            - local
        access_url:
          type: string
          description: >-
            A fully resolvable HTTP address that can be used to GET the actual object bytes.
            Note that at least one of access_url and access_id must be provided.
        access_id:
          type: string
          description: >-
            An arbitrary string to be passed to the /access method to fetch an access_url
        region:
          type: string
          description: >-
            Name of the region in the cloud service provider that the object belongs to.
          example:
            us-east-1

@sarpera -- wdyt? Are you up for creating a PR using OpenAPI v2 now, and confirming it's not too ugly?

susheel commented 5 years ago

I believe the pattern you're suggesting is:

servers can return an access_url_url (name TBD) string, which must be a fully resolvable HTTP GET'table path, and can be on any server

callers do an HTTP GET on the access_url_url, which returns the actual access_url

Is that right? If so, I'm open to discussing the tradeoffs. But I agree with @sarpera that we shouldn't mix the patterns, and call the string an access_id if it's actually a fetchable URL.

@dglazer Yes, almost. I do agree that DRS must be able to support the two-phase access mechanism.

If the DRS server only provides an access_id it is implict in the service description to perform a subsequent GET /objects/<object-id>/access/<access-id> to get the access_url. Is there a use case where a user will only need the access_id? Making the access_url_url (your name :) more configurable or explicit in the service description would allow service providers to support external mechanisms to provide other ways to access_urls.

Either way, I agree with @sarpera we need to also iron out how AUTH tokens are specified and passed to /access or an access_url_url in #47 or #229

dglazer commented 5 years ago

@susheel , it sounds like we largely agree on the two options; good. @sarpera , I suggest you pick one (you know my vote), put it into the PR, and then we can discuss and finalize there.

A few comments on the details

re "implicit in the service description" -- I think it's the job of the spec to make that explicit, including in the comments describing access_id and the /access method
re "_Is there a use case where a user will only need the access_id_" -- no, I don't see one. You have to turn the id into a URL somehow in order to fetch object content
re AUTH tokens -- yes, we need iron that out. In the /access model, I think the answer matches how we handle auth for all other calls to the DRS server, including to the /object call that was used to get the access_id in the first place It feels more complicated in the access_url_url model, since that's a whole different server, so presumably has a different set of auth needs