ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 53 forks source link

Object metadata and download methods #213

Closed sarpera closed 5 years ago

sarpera commented 5 years ago

Background

Following the discussion we had at the GA4GH hackathon in January we would like to propose to have a method to get the metadata of an object, and then have an additional method which will provide the download of the object.

The rationale to have two methods instead of one, is due to the necessity to sign the object using the authorisation token provider (right now this is based on the OIDC specs), which is expensive computationally to do. More over, with the presence of regions and provider, a DRS client will be able to decide which provider and which region would be best to obtain the file, among all the possible URIs.

The format we propose are:

and we propose to pass the authorisation token in the Request Header to get access to the object.

This is the flow, from a DRS client point of view:

1) GET /objects/<id>

2) GET /objects/<id>/download with Request Header X-DRS-TOKEN: <TOKEN>

The token is obtained by the client from the DRS server, and it is up to the DRS Server implementer to decide how a user will obtain that.

Object metadata Request

This will return the object metadata:

HTTP Request

GET /objects/<id>

HTTP Response

{
  "object": {
    "id": "string",
    "name": "string",
    "size": "string",
    ...
    "urls": {
      "cloud": [
        {
          "uri": "s3://<foo>/<bar>.bam",
          "region": "us-east-1",
          "provider": "aws"
        },
        {
          "uri": "gs://<foo>/<bar>.bam",
          "region": "us-west1",
          "provider": "google"
        }
      ],
      "ftp": [
        {
          "uri": "ftp://foo.com/bar.bam"
        }
      ],
      "drs": [
        {
          "uri": "drs://foo.com/objects/<id>"
        }
      ]
    },
    "aliases": [
      "doi://123/abcd"
    ]
  }
}

The client will be able to pick one of the cloud uri and request the download uri, passing the token

Object download Request

HTTP REQUEST

GET /objects/<id>/download?type="cloud"&uri="gs://<foo>/<bar>.bam" 

    Request Header: 
    X-DRS-TOKEN: <TOKEN>

HTTP Response

The return value is a URI where a GET request will give you the bytes:

{
  "uri": "<URL_TO_BYTES>"
}

a GET <URL_TO_BYTES> will start the download of the file.

mattions commented 5 years ago

@bwalsh @dglazer we have written up the proposal we had discussed yesterday at the hackathon.

In this issue.

@susheel would be great to know more on the FTP side, because we are not encountering that in our case, and also it would be good if @philloooo can you double check if this gets what we had written on the whiteboard. got Phillis right github handle :)

bwalsh commented 5 years ago

@mattions - all above looks reasonable. Can you provide a link to document where X-DRS-TOKEN: <TOKEN> originates?

mattions commented 5 years ago

Hi @bwalsh,

I was aiming to tag @briandoconnor that was present at the hackthon. My mistake, but it's great that you like it, it means we actually wrote it in a clear way :).

The token will be dealt by the DRS server, and so they may offer a way to log in and it's a third part service.

As for links, they will be available when the DRS will be up, it's my guess

geoffjentry commented 5 years ago

@mattions Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object?

brucehoff commented 5 years ago

Curious why the plan is to name the header X-DRS-TOKEN rather than Authorization as per the HTTP specification: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8 If the token is a bearer token then the header would look like: Authorization: Bearer <token>.

Also, I wonder if it might be too limiting to say that the GET /objects/<id> API does not require authorization. Could there be cases in which the DRS provider only wants to share the existence of a resource with authorized parties?

geoffjentry commented 5 years ago

@brucehoff My assumption was that it was to discern between "I have authZ to access the endpoint" (i.e. Authorization) and "Here is my authZ for the file", which might not be the same thing. I could be quite wrong, however

That ties in w/ your second question as I picture the use of the Authorization as being ubiquitous across all endpoints.

sarpera commented 5 years ago

@geoffjentry, yes.

@brucehoff, it could very well be named Authorization, we just wanted to point out that it should be an agreed name/convention we all follow. +1 for the bearer token case.

To clarify, we also pointed out in the hackathon that GET /objects/<id> could perfectly require authz if the object in question has an auth layer in front to see the basic metadata about an object (name, size, urls etc). There are cases where an "object metadata" is controlled-access or private, and for those cases the users can pass the same http request header (Authorization: ) in the GET request to see the metadata about the object. Our suggestion does not limit that in anyway, actually we strongly suggest that GET /objects/<id> could require that header in some cases.

@geoffjentry does the paragraph above align with your thoughts?

mattions commented 5 years ago

just fixing my tagging, wants @philloooo to chime in as well :)

philloooo commented 5 years ago
geoffjentry commented 5 years ago

@mattions I think so. I can totally see the case where access to the API and access to the file aren't the same and wouldn't be against setting up a structure for that (but perhaps w/ simplifying defaults, e.g. if just Authorization is passed, just use that).

NB I'm not advocating for that structure, so if others disagree so be it. But at least it doesn't bug me :)

geoffjentry commented 5 years ago

@mattions when you said "yes" was that about the presigned URL question or the authz part?

If the presigned URLs, I didn't feel like we had reached consensus that presigned URLs should be required. If I'm wrong, ignore me :)

zflamig commented 5 years ago

Can I suggest for the response to:

GET /objects/<id>

That we do something like this:

{
    "object": {
        "id": "string",
        "name": "string",
        "size": "string",
                ...
        "uris": [{
                "uri": "s3://<foo>/<bar>.bam",
                "metadata": {
                    "region": "us-east-1",
                    "provider": "s3.amazonaws.com"
                }
            },
            {
                "uri": "gs://<foo>/<bar>.bam",
                "metadata": {
                    "region": "us-west1",
                    "provider": "storage.googleapis.com"
                }
            },
            {
                "uri": "s3://<foo>/<bar>.bam",
                "metadata": {
                    "region": "us-east1",
                    "provider": "onprem.objectstorage.example.com"
                }
            },
            {
                "uri": "ftp://foo.example.com/bar.bam"
            },
            {
                "uri": "drs://<id>",
                "metadata": {
                    "provider": "drs.example.com"
                }
            }
        ]
    }
}

Specifically, I would leave the aliases array out of this response and make that a different endpoint. I'm not sure theres a use case for it here and it adds a potentially expensive join.

geoffjentry commented 5 years ago

@zflamig Part of the idea here was to specifically separate out the download URL from the metadata, as per discussions at the face to face earlier this week

zflamig commented 5 years ago

Thats only the metadata for the URI @geoffjentry to help in resolving it. The original proposal includes this but it isn't clean or extensible. If a new URI appears we would have to amend the spec to add a new class for it...

sarpera commented 5 years ago

@geoffjentry

Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object?

/objects/<id>/download?foo=bar should return "url-to-bytes" in this following use-case: The main use case is to provide access to a single file in a private bucket in a cloud environment, by URL signing, without giving access to the whole bucket, so that the consumer can access the raw data file bytes by passing a token in the request header. If the urls returned from /objects/<id> do not need to be signed, or you already have direct bucket access, or the urls returned are open/public access URIs, you wouldn't need to call the /objects/<id>/download?foo=bar.

For the other cases, if any, where the returned urls is not ready to give access to bytes, let's add those use cases here to figure if they can also be solved via /objects/<id>/download?foo=bar.

@zflamig the structure of urls is open for discussion. The reason why we wanted to key-value pair them was to have a separate model in swagger for a typed url, e.g. for all cloud URLs use CloudURLModel etc. Also, for the future use cases where you would query: /objects/?url_type=ftp, if that's relevant. Regardless, your suggestion also looks clean and good to me.

@geoffjentry I think @zflamig referred to aliases array not be a part of /objects/<id>, not the urls array. Our proposal suggests having urls as part of /objects/id.

@zflamig about the aliases part, do you support the use cases where aliases can be used for querying? E.g /objects/?alias=doi://<id>. We should take that discussion to a separate issue though, please feel free to create one.

philloooo commented 5 years ago

I second on having a provider(which is the server domain or root endpoint) field for the uri type that's drs uri. Cuz I don't think we agreed on the DRI uri format being restrictively drs://<server>/...

dglazer commented 5 years ago

Thanks for starting this discussion @sarpera . A few comments, some of which were partially raised by others, but I'm not sure I understand where they landed:

1) process question -- have you created a PR to go with this issue, or are you waiting for the discussion to settle down first? Either way can work, since prose can be easier for high-level design, but we won't be able to nail down the details until we get to code.

2) I agree we should delete aliases from this issue, and discuss it separately -- there are enough moving parts here already.

3) Re signed URLs -- we definitely don't want to require their use, since some DRS implementors said they won't be using them. That means that sometimes when callers want to fetch bytes, they'll have to pass an auth token when calling the URL_TO_BYTES. I think the spec for that is straightforward, and similar to language you use elsewhere -- something like:

Some access methods require an auth token to fetch bytes. It is the client's responsibility to obtain that token, using a procedure documented by the DRS implementor. (Note that a single token can often be used to fetch multiple objects from a single source.)

4) I think you're suggesting that callers only use the /download method when they need to generate a signed URL, and that they just fetch the bytes directly if not? I guess that works, but it feels confusing to me, requires special knowledge about different access methods, and uses uri-like-strings that aren't used to actually fetch data. Instead, I was picturing:

a) the first method (GET /object/<id>) takes an <id> as input and returns an array of access-methods, which include info about the type of access (e.g. "ftp", "GCP us-east", "AWS us-west"), but don't necessarily include a URI

b) the second method (GET /object/<id>/download) takes an <id> and an <access-method> as input, and returns an URL_TO_BYTES uri.

c) As an optional shortcut, if the DRS implementor chooses, they can include an URL_TO_BYTES uri in individual records returned by the first method, which lets callers skip a step. Implementors would be likely to do that for public content, and for content that can be accessed with a previously-obtained auth token (e.g. gs: and s3:), and unlikely to do that for signed URLs. But we don't have to bake that knowledge into the protocol or the calling code -- the rule for callers is "always use the URI to get bytes; if you don't get one from the first method, ask for one using the second method".

5) I agree that DRS implementors can choose to require an auth token before they respond to a GET request for metadata, and they can return an "access denied" if appropriate. We should document that somewhere, as we do for WES, but I think that auth policy and those tokens are completely separate from the policy and tokens used to call the /download method, or passed to the URL_TO_BYTES uri.

6) I agree with @brucehoff that the auth token passed to /download feels like standard HTTP auth, and should be a standard HTTP auth token. (Since the only question is whether you're allowed to fetch the bytes; there's no use case I can imagine for "it's okay to call the download method but it's not okay to download the object".)

Minor points -- these can wait until we get the big picture settled:

7) Why would a call to DRS return an access-method of type DRS? (As in your fourth example.) I'm not following the use case that addresses.

8) Are you proposing that id, name, and size the only metadata fields we support? Or is that just a placeholder in this issue, and we can figure out the actual fields in a separate issue?

9) /download feels a little off to me, especially for cloud-to-cloud use cases. Maybe /bytes or /body instead? I'm open.

zflamig commented 5 years ago

@sarpera Ahhhh. I see what you were trying to accomplish now. I am okay with your method if we break it up and use the protocol as the high level container. So like

    "urls": {
      "s3": [
        {
          "uri": "s3://<foo>/<bar>.bam",
          "region": "us-east-1",
          "provider": "aws"
        }
     ],
     "gs": [
        {
          "uri": "gs://<foo>/<bar>.bam",
          "region": "us-west1",
          "provider": "google"
        }
      ],
      "ftp": [
        {
          "uri": "ftp://foo.com/bar.bam"
        }
      ],
      "drs": [
        {
          "uri": "drs://foo.com/objects/<id>"
        }
      ]
    }

This way we can make it easier for future additions... if you have a new protocol you are free to require whatever metadata you want.

@dglazer re: #7 on your list: the use case there is for data bundles, so you can have one DRS url that dereferences to a group of DRS urls. For example, during the cohort creation process a GUID/DRS entry may be minted to represent the data that the user selected.

sarpera commented 5 years ago

@zflamig thanks, it gets better with every iteration. +1 for your suggestion.


@dglazer

  1. process question -- have you created a PR to go with this issue, or are you waiting for the discussion to settle down first? Either way can work, since prose can be easier for high-level design, but we won't be able to nail down the details until we get to code.

I haven't created a PR for exact the same reason you pointed out. And you're totally right the details won't be clear without the code changes. Just wanted to discuss with the group some more, as the suggestions are really helping so far.


  1. Re signed URLs -- we definitely don't want to require their use, since some DRS implementors said they won't be using them. That means that sometimes when callers want to fetch bytes, they'll have to pass an auth token when calling the URL_TO_BYTES. I think the spec for that is straightforward, and similar to language you use elsewhere -- something like:

Yes, the callers would pass an auth token when calling the URL_TO_BYTES. Nothing really changes for those who don't have this use-case. Ideally, the token would be the same across all the endpoints on a DRS server, even more ideally all DRS implementations would have the same means of obtaining it. There is a related issue.


  1. I think you're suggesting that callers only use the /download method when they need to generate a signed URL, and that they just fetch the bytes directly if not? I guess that works, but it feels confusing to me, requires special knowledge about different access methods, and uses uri-like-strings that aren't used to actually fetch data. Instead, I was picturing:

I guess what I was trying to say was along these lines:

GET /object/<id> returns a list of urls/access-points, regardless of whether or not those URIs are ready to be consumed. Here is the reasoning:

1- If the some of the urls/access-points are ready to consume for getting bytes (public urls), there is no further action needed to call another endpoint.

2- For cloud URIs, the URIs returned from GET /object/<id> is enough if the consumer has direct bucket access privileges for the private bucket in question. They also wouldn't need to sign the url. So it adds a value to have those URIs there for this case.

3- Having GET /object/<id>/download?foo=bar primarily solves the case where a file belongs to a private bucket, and the consumer HAS TO sign a URL for that file in order to access it, without having direct access to the whole bucket. This is the majority of the use-cases for many datasets that have a controlled-layer of access.

4- GET /object/<id>/download?foo=bar can be used for any other case where the urls returned from GET /object/<id> is not readily-consumable.

To sum up above, I do really agree with you that:

GET /object/<id> takes an as input and returns an array of access-methods

and

the second method (GET /object//download) takes an and an as input, and returns an URL_TO_BYTES uri.

sounds more reasonable. That way, the implementors MAY choose not the include the URIs in GET /object/<id> if there is no added benefit. But having the URIs there have a point as I tried to explain above.


  1. I agree that DRS implementors can choose to require an auth token before they respond to a GET request for metadata, and they can return an "access denied" if appropriate. We should document that somewhere, as we do for WES, but I think that auth policy and those tokens are completely separate from the policy and tokens used to call the /download method, or passed to the URL_TO_BYTES uri.

+1 for this. There could be separate auth policies for both cases, and implementors MAY choose to have the same policy for both if it fits them. I guess this is not against your point.


  1. I agree with @brucehoff that the auth token passed to /download feels like standard HTTP auth, and should be a standard HTTP auth token. (Since the only question is whether you're allowed to fetch the bytes; there's no use case I can imagine for "it's okay to call the download method but it's not okay to download the object".)

+1 for standard HTTP auth token or a bearer token. One of the possible cases would be that your authz privileges might have been revoked or expired, but you already obtained a token. In that case, the flow is solid anyway, you would get a 403 on GET /object/<id>/download


  1. Why would a call to DRS return an access-method of type DRS? (As in your fourth example.) I'm not following the use case that addresses.

During the hackathon we were made aware that there are some implementors who will be using DRS as a data-registry service, without necessarily providing a direct access to bytes, but instead pointing out to a another DRS server (via a DRS url) where the data can be accessed. Sort of like "linked DRS"es or DRS of DRSes. @susheel could you please perhaps provide those uses cases?


  1. Are you proposing that id, name, and size the only metadata fields we support? Or is that just a placeholder in this issue, and we can figure out the actual fields in a separate issue?

Oh no, it was just a placeholder since I didn't want to type every other metadata fields, hence the .... Sorry if it was misleading. Related to this, I'm all up for bringing this to another issue where we define the required fields in GET /object/<id>, based on the break-out session on 1st day of hackathon. Current model on swagger seems out of date.


  1. /download feels a little off to me, especially for cloud-to-cloud use cases. Maybe /bytes or /body instead? I'm open.

Same here, open for any ideas. The most difficult part of building anything is to name it =) /bytes, /access?

briandoconnor commented 5 years ago

From the GA4GH call today, @sarpera and @susheel discussed what happens with a DRS entry for an object when you call GET id/download... OK to not implement seems to be the consensus.

Seems like we need to clarify how "/download" works for the various URI types

@dglazer proposed get bytes URI, fetch bytes ID... for passing to the download method. so the ID -> URI

@sarpera is going to take this ticket and make a PR that explores what he and David talked about today... sort out the URL and the download in a single PR. @dglazer @rishidev and I will work out a process to bring this and other PRs up to vote via the active drivers

susheel commented 5 years ago

Need to clarify /download for non-cloud (legacy) data endpoints, e.g:

sarpera commented 5 years ago

Wrapping up so far

Good to see that there is a general consensus on the main idea that accessing "object metadata" and "bytes to object" may be separate calls to DRS for the cases where an "action" is required to be performed to get access to bytes e.g passing an auth token to: sign a URL, generate url-to-bytes with credentials etc.

By doing so, I guess we all agree that the schema should remain generic, flexible and understandable, yet providing programatically parsable responses for the clients with different needs and use-cases. With that in mind, I tried to combine our ideas together and here's the outcome:

Object metadata: GET https://example.com/ga4gh/drs/v1/objects/<id> Returns metadata of an object, with the set of access-methods.

Object bytes: GET https://example.com/ga4gh/drs/v1/objects/<id>/download/<access-method-id> Returns a "uri-to-bytes" for a given access-method-id, if it exists.


Examples

GET https://example.com/ga4gh/drs/v1/objects/<id>

Response:

{
  "object": {
    "id": "foo",
    "name": "bar.bam",
    "size": "1234",
    "urls": {
      "s3": [
        {
          "uri": "s3://foo/bar.bam",
          "region": "us-east-1",
          "<access-method-id>": "s3-1"
        }
     ],
     "gs": [
        {
          "uri": "gs://<foo>/<bar>.bam",
          "region": "us-west1",
          "<access-method-id>": "gs-1"
        }
      ],
      "ftp": [
        {
          "uri": "ftp://foo.com/bar.bam"
        }
      ],
      "drs": [
        {
          "uri": "drs://foo.com/objects/<id>"
        }
      ]
    }

GET https://example.com/ga4gh/drs/v1/objects/<id>/download/<access-method-id> with Request Header Authorization: <string>

Response:

{ "uri": "<uri-to-bytes>" }

Let's break apart the suggested urls property of an object:

"urls": {
  "<access-method>": [
    {
      "uri": "<string>",
      "<access-method-specific-attr>*": "<value>"
    }  
  ]
}

where

<access-method> is enumeration, e.g s3 | gs | ftp | http | drs | gsiftp | globus | aspera <access-method-specific-attr> is an attribute that belongs to a specific access-method, described in the schema model (can be multiple attributes per access-method).


Questions

Why have a key-value paired access methods? So that a specific <access-method> can have its own properties in the schema model.

E.g: in the cloud scenarios, there is a huge added value of having the region information for a URI whereas for other <access-method>s that property may be meaningless. Swagger schema model should set the expectations for client to consume this information programatically.

Why the value of <access-method> is an array? One access method can have multiple URIs with different attributes. E.g: data duplicated on different regions on the same cloud provider:

In the example below, access-method s3 has two entries.

"urls": {
  "s3": [
    {
      "uri": "s3://foo/bar.bam",
      "region": "us-east-1",
      "<access-method-id>": "s3-us"
    },
    {
      "uri": "s3://baz/bar.bam",
      "region": "eu-central-1",
      "<access-method-id>": "s3-eu"
    }
  ]
}

What if all the <access-method>s are public or readily consumable? Then the <access-method> wouldn't have a <access-method-id> property to begin with. Any calls to non-existing /download/<access-method-id> would return 400.

Example:

"urls": {
  "ftp": [
    {
      "uri": "ftp://foo/bar.bam"
    }
  ]
}

How to mint an <access-method-id>? As long as it's url-encoded and unique per object, it can be any string value, up to the implementor.


Help needed with naming things!

How to name <access-method-id> property? id? fetch-id? access-id?

Naming the suggested new path, currently "download" Some people raised concerns about calling it download. Any ideas? bytes? fetch?


TODOs

Will make a PR with the suggested changes reflected on the swagger schema.

mattions commented 5 years ago

On the naming side I propose:

  1. access-method-id to keep it like it is
  2. swap download for access

so the Urll will look like: https://example.com/ga4gh/drs/v1/objects/<id>/access/<access-method-id>

So something like this:

{ "object": 
    { "id": "foo", 
      "name": "bar.bam", 
      "size": "1234", 
      "urls": { "s3": [ 
                { 
                    "uri": "s3://foo/bar.bam", 
                    "region": "us-east-1", 
                    "access-method-id": "s3-1" 
                 } 
               ], 
                "gs": [ 
                    { "uri": "gs://<foo>/<bar>.bam", 
                      "region": "us-west1", 
                      "access-method-id": "gs-1" 
                      } 
                    ], 
    }
}

will have the following allowed calls:

# for s3
GET https://example.com/ga4gh/drs/v1/objects/foo/access/s3-1
# for gs
GET https://example.com/ga4gh/drs/v1/objects/foo/access/gs-1

with Request Header Authorization: <string>

If we have consensus, we can move this next with the PR

dglazer commented 5 years ago

_[updated to rename uri to access_uri, and to use underscores]_ Thank you @sarpera for incorporating everyone's input -- I personally think this is very close, and ready to move to a PR for final discussion. I'm fine with /access as the URL for the second method. I have a few suggestions on the response to the first method, GET /objects/<id>:

Incorporating my proposals, your example would look like:

"access_methods": [
   "s3": {  # there's no uri, meaning the caller has to call /access before fetching bytes
      "region": "us-east-1",
      "access_id": "s3-us"  
   },
   "s3": {
      "region": "eu-central-1",
      "access_id": "s3-eu"
   },
   "gs": {  # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
      "region": "us-west1", 
      "access_uri": "gs://foo/bar.bam", 
      "access_id": "gs-1" 
      },
   "ftp": {  # there's no access_id, meaning the caller has to fetch the bytes directly
      "access_uri": "ftp://foo.com/bar.bam"
   }
]
sarpera commented 5 years ago

Thanks @dglazer, I'm also happy to see that PR I'm setting was actually pretty close to your input.

@mattions 2. swap download for access @dglazer I'm fine with /access as the URL for the second method.

I agree. Already used /access to be a new path in my local changes for the PR.


  • I suggest renaming urls to access-methods, which better matches the actual array elements.

I agree. In my git local changes I already set to it be access_methods following the naming convention on the yaml.


  • To support access methods that have multiple entries (as in your S3 example), I slightly prefer flattening things by making the top-level an array, and allowing multiple entries for any given type of access. But that's just taste; I'll go with whatever most people think feels natural.

After diving into the yaml code, I also figured having an array will make things a bit easier to describe via swagger v2.0. Also, opens future possibilities to make the items in the array more searchable in a uniform way. I'm swaying away from s3: {foo: bar} idea, seems like it won't be a general-enough solution to encompass future cases e.g Azure, Ali cloud and wherever else protocol/pseudo-protocol people might use to point to their data. @zflamig also pointed out the issues with extensibility aspect of it, I agree with that. See the notes below.


  • I slightly prefer access-id to access-method-id, just because it's shorter. Again, I'll go with the sense of the crowd.

I agree. Already used access_id in my local changes for the PR.


  • I suggest not returning a uri unless you can actually use it to fetch bytes. That means the DRS implementor can choose what kinds of access are supported -- if they return a uri the caller uses it to fetch bytes directly (and is responsible for knowing what if any auth tokens they need to pass), if they return an access-method-id the caller uses it to call the /access method, and if they return both (which I think will be rare) the caller can choose.

As you mentioned, there are use cases (perhaps rare) that you might have both uri and access_id at the same time. Example case: a controlled-access file stored in a cloud bucket will have a URI s3://foo/bar.bam which might be enough for a user who has a direct bucket access, but for the ones who don't have a direct bucket access, they will rely on signing a URL via access_id. But the schema would require having the access_id property to be used in /access/<access_id> path.


GET /objects/{object_id}/access/{access_id} via Authorization Request Header

Retrieve a URL to access bytes of an (controlled-access) Object

Response:

{
  "url": "string"
}

GET /objects/{object_id}

Response:

{
   "object": {
      "id*": "string",
      "name*": "string",
      # ... rest of the properties
      "access_level*":  "open | controlled",
      "access_methods*": [
         {
            "uri*": "string",
            "access_id*": "string",
            "cloud_metadata": {
               "region*": "string",
               "provider*": "string"
            },
            "protocol*": "string"
         }
      ]
   }
}

Some notes:

bwalsh commented 5 years ago

Great discussion. It is rewarding to see this work move forward.

Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost.

How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]?

Also, I'm assuming the checksums object is part of '# ... rest of the properties' ?

Forgive me if I've missed it, but are there formal dependencies to a (probably separate) Search Service to query this data?

BTW, I always thought that urls was misnamed, nice to see it morph to access_methods.

philloooo commented 5 years ago

@sarpera I'd like to remind a point that Zac and me mentioned in previous comments, for the drs access_method, it needs a provider field so we are not limiting the drs uri to put the hostname in the identifier.

        {
          "uri": "drs://<someid>",
          "provider": "drs.example.org",
        } 
ddietterich commented 5 years ago

I think we need to put a DNS name in the DRS URI. Otherwise, we have to get into the business of service resolution. I don't have much appetite for boiling that ocean.

dglazer commented 5 years ago

@sarpera -- glad we're converging. Happy to hash out the remaining details in the PR, but in case it's helpful here are a few thoughts on your latest comment:

susheel commented 5 years ago

Extending @dglazer's example, why can't we be explicit as below?

GET /objects/{object_id}

"access_methods": [
   "s3": { 
      "access_id": "drs://server.com/access/s3-us", 
      "region": "us-east-1",
   },
   "s3": {
      "access_id": "http://server.com/get_object/s3-eu", 
      "region": "eu-central-1",
   },
   "gs": {  # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
      "region": "us-west1", 
      "access_uri": "gs://foo/bar.bam", 
      "access_id": "drs://server.com/access/gs-1" 
      },
   "ftp": {  # there's no access_id, meaning the caller has to fetch the bytes directly
      "access_uri": "ftp://foo.com/bar.bam"
   }
]

This way implementors can also point the user to use external services to implement the /access method.

I'm personally agianst baking the /access method into the DRS specification, but I may be in the minority and happy to commit if the community goes this way. If we do, we need to agree on #214 as there will many mechanisms to get signed URLs.

I'm still unclear how this will work for non-cloud private data URLs (FTP, GSIFTP, etc.). Could "contact": "John Doe <john@doe.com>" be added to each access_method class? Or can we have a pseudo access_id that returns the contact info for private FTP, GSIFTP, etc.?

dglazer commented 5 years ago

@philloooo -- I suggest we split discussion of the drs access_method into a separate issue/PR; I don't think the details of that discussion will affect the the outcome of this discussion.

dglazer commented 5 years ago

@bwalsh , re a separate Search Service -- my mental model is that there aren't any formal dependencies, but there's an expectation that, once a search/discovery API is defined, many callers of DRS will use it to get the object ids they pass in to DRS.

zflamig commented 5 years ago

@dglazer @sarpera

I don't understand why we need a provider field in addition to access_type -- when would those be different? (e.g. I think provider = "AWS" if and only if access_type="s3".)

This is explicitly being driven by a GDC/DCF use case where we run on-premise object storage systems that use an S3 compatible API so we need to know if the file is actually on Amazon or our local object storage. We could potentially overload the region to capture it too, but just having a provider field seems more clean.

dglazer commented 5 years ago

Thanks @zflamig for the explanation -- my quick reaction is that's cleaner to represent as either a pseudo region (if from the caller's point of view it behaves exactly like S3 except the bytes are in a physically different place), or a different access_type (if the caller needs to be aware of more method-specific differences). Basically I'd rather keep the API itself simpler for the majority of callers, and have implementation-specific needs fit into implementation-specific extensions. Thoughts?

zflamig commented 5 years ago

@dglazer In general I agree, but in practice with thinking how clients might actually interact with this information I feel like keeping it separate is easier. For example, a client that only knew how to support AWS S3 would have a very simple check on the provider to see if its the AWS hostname. When doing the region's they would have to know a list of all the current AWS regions to know if the listed region is real or not.

Ultimately, I'm happy either way so long as we agree to support this in some fashion. I just have a strong preference towards clients being able to write stable code that is easy to test and doesn't need to be updated when AWS adds new regions.

tetron commented 5 years ago

My $.02 the method, provider, and region should be separate. For example, it might be cheaper to transfer between two Google cloud regions than to transfer between AWS and Google cloud regions that are physically closer. (On the other hand, physically closer might be faster).

As I mentioned at the F2F, a client library that implements a preference matrix for deciding how to fetch data would probably help shed some light on the best way to represent this.

sarpera commented 5 years ago

Thanks for all the feedback!

@bwalsh

Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost.

Exactly, this is what drove us initially to use a defined language to describe the access methods.

How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]?

One way to go for it is to have strongly typed schema model per an access method and enforce required params thereof. Schema model for those Individual access methods should organically evolve when we get more use-cases iterated over in time.

Also, I'm assuming the checksums object is part of '# ... rest of the properties' ?

Yes. Wanted to skip details since there is an issue for that already.


@zflamig @philloooo SevenBridges also has the same use case as Zac defined for cloud URIs. I'll suggest adding provider only for cloud-related access methods. Those for the DRS urls, I agree with @dglazer that the resolution of DRS URL might get tricky if they are coupled with provider info. For the cases Zac defined, any strong opinions against having provider string in s3, gs etc? See the updated model below.


@susheel Yes, being explicit seems like where most of us align. I also agree that #214 needs to agreed upon.

The strong case, at least for us, to push for /access is that our DRS server will be the same service who'll provide signed URLs on demand for private cloud resources in DRS. Avoiding this would make DRS urls pointing to a controlled-access data quite useless on their own. With the proper authN/Z in place, DRS would provide access to private/protected/public data in our case, making DRS URLs programatically actionable, especially for WES scenarios.

I'm still unclear how this will work for non-cloud private data URLs (FTP, GSIFTP, etc.). Could "contact": "John Doe <john@doe.com>" be added to each access_method class? Or can we have a pseudo access_id that returns the contact info for private FTP, GSIFTP, etc.?

It would help greatly if you could provide a complete use case for non-cloud private data URLs.
How is the data actually accessed after a user is given privileges? What happens when I contact the author and I'm given access somehow? Do I add my credentials to the URI of the ftp file? If we add contact info per access method, how would we utilise this programatically? Overall, it would be awesome to cover as much cases as possible with the idea behind /access or shape it differently based on the use cases.


@tetron

My $.02 the method, provider, and region should be separate. For example, it might be cheaper to transfer between two Google cloud regions than to transfer between AWS and Google cloud regions that are physically closer. (On the other hand, physically closer might be faster).

Agreed. Since we seem to go for strongly typed access methods, we could make cases for cloud-related methods to provide these information. Seven Bridges and @zflamig also have use-cases for explicit provider property. Note that it doesn't affect the original proposal i.e /access/<access-id>


Schema

Object has access_methods property which is an array of AccessMethods:

 "access_methods": <AccessMethod>[]

where an AccessMethod is:

{
    "<x>": <xAccessMethod>
}

DRS defines the values for x and their corresponding schema models i.e xAccessMethod in the specification.

Example AccessMethods:

{
    "s3": {
        "uri*": "string",
        "access_id": "string",
        "region*": "string",
        "provider": "string",
        "allowed_regions": [
            "string"
        ]
    }
}
{
    "drs": {
        "uri*": "string"
    }
}
{
    "ftp": {
        "uri*": "string"
    }
}

Example response of an object:

{
    "object": {
        "id": "1234",
        "name*": "bar.bam",
        # ... rest of the properties
        "access_methods": [
            {
                "s3": {
                    "uri": "s3://foo/bar.bam",
                    "access_id": "s3-1",
                    "region": "us-west-1",
                    "provider": "s3.amazonaws.com",
                    "allowed_regions": [
                        "us-west-1", "us-east-1"
                    ]
                }
            },
            {
                "gs": {
                    "uri": "gs://foobaz/bar.bam",
                    "access_id": "gs-1",
                    "region": "us-central1",
                    "allowed_regions": [
                        "us-central1"
                    ]
                }
            },
            {
                "ftp": {
                    "uri": "ftp://foo.org/baz/bar.bam"
                }
            },
            {
                "drs": {
                    "uri": "drs://some-other-drs.org/9876"
                }
            }
        }
    ]
}

Initial idea of having strongly typed access methods seems to be favoured by most of us. Please note that with this approach, in order to add a new access method we'd need to define it and update the schema. It is of course expected to have more properties in said AccessMethods. Now that we seem to agree on being explicit about individual access methods, there is a room for that. IMHO it adds value in the long run when it comes to setting expectations for the DRS consumer/client to parse this information.

Thoughts?

tetron commented 5 years ago

What is allowed_regions ?

Why {"s3": { ... } } instead of {"method": "s3", ...} ?

What is the difference between uri and access_id ? I see they are different here but is there a reason it can't just provide uri when using the /access/ endpoint?

For the ftp case, perhaps the provider should be the ftp host? I think it is okay if region is sometimes null but I think provider should always be filled in, even if it is just a hostname.

I am thinking about the client's decision matrix. I think we want a tuple of (method, provider, region) and the client assigns a preference or weight to each tuple based on (a) availability of credentials and (b) function of cost and expected transfer speed.

For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?)

For the private access case, the client can have a table of credentials that correspond to various combinations of (method, provider, region) (could include wildcards.)

susheel commented 5 years ago

@sarpera For the ftp AccessMethod, see example below

{
  "method*": "ftp",
  "provider*": "string"
  "uri*": "string",
  "region": "string",
  "contact": "string"
}

Fully realised example:

{
  "method": "ftp",
  "provider": "ftp.ebi.ac.uk"
  "uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
  "region": "null",
  "contact": "Contact John Doe <john.doe@example.com>"
},
{
  "method": "ftp",
  "provider": "ftp-private.ebi.ac.uk"
  "uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
  "region": "ebi-hh",
  "contact": "Contact Jane Doe <jane.doe@example.com>"
}

I'm guessing it would be the same for gridftp, sftp, Globus and Aspera will be a little complicated - I would need to think about this a little more.

susheel commented 5 years ago

@sarpera Do you see the possibility of having a local AccessMethod too. Example below:

{
  "method*": "local",
  "provider*": "string"
  "uri*": "string",
  "region": "string",
  "contact": "string"
}

Fully realised example:

{
  "method": "local",
  "provider": "ebi-cluster.ebi.ac.uk"
  "uri": "file://public/path/file",
  "region": "ebi-hx",
  "contact": "Contact John Doe <john.doe@example.com>"
},
{
  "method": "local",
  "provider": "ebi-yoda.ebi.ac.uk"
  "uri": "file://private/path/file",
  "region": "ebi-hh",
  "contact": "Contact Jane Doe <jane.doe@example.com>"
}
sarpera commented 5 years ago

@tetron

What is allowed_regions ?

Buckets can be set to incur outbound (egress) costs outside of its region in the same cloud provider. This provides more information in the decision making process to pick the most appropriate mirror of the file. Perhaps not the best name for the attribute though.

Why {"s3": { ... } } instead of {"method": "s3", ...} ?

The former allows to define a schema model per access method so that method-specific attributes can be defined and enforced for consistency. Happy to discuss if the same goal can be achieved in a different way.

What is the difference between uri and access_id ? I see they are different here but is there a reason it can't just provide uri when using the /access/ endpoint?

@dglazer also made some points about it. There may be cases where for a specific access method URI may not give any means of access e.g a file residing in a VPC and the only means of providing access to third-parties is signing a URL via /access method. We keep both attributes, for the cases like having both direct bucket access and option to sign a URL on demand, depending on the consumer of the object. Then both attributes in fact can provide access and meaningful to have.

Please also note that the cloud data owners may not want to (or be allowed to) expose their bucket names in the URIs, but may provide access via /access/<access> with the proper authZ.

We can pursue some additional capabilities where; while keeping /access/<access_id>, providing /access?uri="<uri_string>". But IMVHO having a dedicated path like /access/<access_id> is much cleaner and approachable considering above cases.

I am thinking about the client's decision matrix. I think we want a tuple of (method, provider, region) and the client assigns a preference or weight to each tuple based on (a) availability of credentials and (b) function of cost and expected transfer speed.

This is a very important point and setting the individual access method attributes by aiming that goal would help us achieve that. I hope this aligns with your second question and answer I tried to provide.

For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?)

Yes, exactly. Having that dedicated path /access/<access_id> will make this clear in the schema since the attribute required to craft this path will be enforced. This partially also answers your 3d question. Happy to explore alternative approaches if this isn't intuitive.

For the private access case, the client can have a table of credentials that correspond to various combinations of (method, provider, region) (could include wildcards.)

Could you please explain this a bit more with examples? Are you talking about discoverability of the available access methods based on existing client conditions?


@susheel thanks for the examples.

For the ftp case, perhaps the provider should be the ftp host? I think it is okay if region is sometimes null but I think provider should always be filled in, even if it is just a hostname.

Based on the previous schema definitions I provided, each defined access method would have its own attributes based on its needs. So region would be null for ftp cases. Similarly if the provider doesn't add any information, we don't need to add that for ftp case. So we could do something like this:

access_methods: [
    { 
        ftp:    {
            "uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>"
        }
    },
        ftp:    {
            "uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>"
        }
    }
]

@sarpera Do you see the possibility of having a local AccessMethod too. Example below:

local could be a new access method then I presume. Let's gather more use cases to define its attributes.

susheel commented 5 years ago

@sarpera For ftp I would agree with @tetron's previous comment:

For the ftp case, perhaps the provider should be the ftp host? I think it is okay if region is sometimes null but I think provider should always be filled in, even if it is just a hostname.

Having a provider set for all access_methods even if it just a hostname would make filtering easier, so I would add this into your example above.

region should be available to the ftp access method, e.g. "region": "ebi-hx" or may be optionally set to null, e.g. when behind a loadbalancer.

sarpera commented 5 years ago

@susheel making filtering easier by using provider is a solid point. I guess the only argument against it was to couple the provider and URI resolution for the DRS case. I'm all up for having a provider field as long as it's not coupled with means of accessing, which should be a job of uri field or /access/<access_id> method.

So updated example would be:

access_methods: [
    { 
        ftp:    {
            "uri": "ftp://anonymous:anonymous@ftp.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>",
            "provider": "ftp.ebi.ac.uk"
        }
    },
        ftp:    {
            "uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
            "contact": "Contact John Doe <john.doe@example.com>",
            "provider": "ftp-private.ebi.ac.uk"
        }
    }
]

I feel like contact property could be defined strongly. Maybe more structured? We could be more explicit about the value so set expectations right. Perhaps a field for just email as a value, or a field for ORCIDs? Just to avoid open-ended, vague string values.

region should be available to the ftp access method, e.g. "region": "ebi-hx" or may be optionally set to null, e.g. when behind a loadbalancer.

Is region a known terminology for ftp cases? In the cloud cases, the meaning of the property is quite well-established. Naive question; how would region affect decision making on picking the right access method for FTP-like URIs?

sarpera commented 5 years ago

Seems like current version of OpenAPI doesn't allow patternProperties as of 3.0.2. Unless we want to hardcode all available access methods (ftp. http, s3 etc) in the schema and pair them with a schema model (array of objects), the above approach won't work in practice.

@tetron

Why {"s3": { ... } } instead of {"method": "s3", ...} ?

Going back to this approach, seems like with open api v3 this can be achieved while still enforcing certain properties per access method (region for cloud methods etc) by making use of anyOf when defining the access_methods array.

I'd like to recap our requirements so far about the access methods before adjusting them to v3.0.

access_methods

And so far our use cases for an access method are:

Required / Optional bare minimum properties are different:

Use Case URI method region access_id provider contact
Cloud - open R R R O O O
Cloud - controlled/private O R R R O O
FTP, HTTP, globus, aspera etc R R - O O O
DRS of DRSes R R - O O O
Local R R - ? O O

Based on that, examples would look like:

Open access data:

"access_methods": [
      {
        "uri": "ftp://foo.example.com/file.name",
        "method": "ftp",
        "provider": "foo.example.com"
      },
      {
        "uri": "s3://foo-open-bucket/file.name",
        "method": "s3",
        "provider": "s3.amazonaws.com",
        "region": "us-east-1"
      }
],

Controlled/private access data:

"access_methods": [
      {
        "uri": "ftp://bar.example.com/file.name",
        "method": "ftp",
        "provider": "foo.example.com",
        "contact": "foo@example.com"
      },
      {
        "access_id": "s3-1",
        "method": "s3",
        "region": "us-east-1"
      },
      {
        "uri": "drs://foo.example.com/123",
        "method": "drs"
      }
],

Enforcing required/optional properties can be done via anyOf and enumerated values for method property in the schema. (please not that inheritance in OpenAPI does not allow overriding required properties, hence some redundancies bellow)

    AccessMethods:
      type: array
      description: The list of access methods that can be used to access the Data Object.
      minItems: 1
      items:
        anyOf:
        - $ref: '#/components/schemas/StaticAccessMethod'
        - $ref: '#/components/schemas/CloudAccessMethod'
        - $ref: '#/components/schemas/ActionableStaticAccessMethod'
        - $ref: '#/components/schemas/ActionableCloudAccessMethod'
        discriminator:
          propertyName: method

    ActionableAccessMethod:
      type: object
      required:
        - access_id
      properties:
        access_id:
          type: string

    ActionableCloudAccessMethod:
      type: object
      allOf:
        - $ref: "#/components/schemas/ActionableAccessMethod"
        - type: object
          required:
            - region
            - method
            - access_id
          properties:
            uri:
              type: string
            method:
              type: string
              enum:
                - s3
                - gs
            region:
              type: string
              description: >-
                Name of the region in the cloud service provider that the object belongs to.
              example:
                us-east-1
            provider:
              type: string

    CloudAccessMethod:
      type: object
      required:
        - uri
        - region
        - method
      properties:
        uri:
          type: string
        provider:
          type: string
        method:
          type: string
          enum:
            - s3
            - gs
        region:
          type: string
          description: >-
            Name of the region in the cloud service provider that the object belongs to.
          example:
            us-east-1

    ActionableStaticAccessMethod:
      type: object
      allOf:
        - $ref: "#/components/schemas/ActionableAccessMethod"
        - $ref: "#/components/schemas/StaticAccessMethod"

    StaticAccessMethod:
      type: object
      required:
        - uri
        - method
      properties:
        method:
          type: string
          enum:
            - ftp
            - sftp
            - http
            - https
            - nfs
            - globus
            - aspera
            - gsiftp
            - nfs
            - local
        uri:
          type: string
        provider:
          type: string
        contact:
          type: string
susheel commented 5 years ago

@sarpera Thanks for investigating the OpenAPI spec compatibility. I agree with @tetron it would have been cleaner, but I guess we will have to live within our means! :)

I thought we'd discussed (maybe not agreed) that we will be more explicit with the access_id to be a uri. E.g.

Controlled/private access data:

"access_methods": [
      {
        "access_id": "drs://server.com/access/s3-1",
        "method": "s3",
        "region": "us-east-1"
      }
      {
        "access_id": "http://server.com/get-object/s3-1",
        "method": "s3",
        "region": "us-east-1"
      }
],

Which I hope will work for your use case when it is provided by the DRS service, and when it may be provided by a third-party service.

P.S. If this is acceptable, why have it called access_id, we could just call it uri

sarpera commented 5 years ago

@susheel access_id was made explicitly to be used in this path /objects/<id>/<access_id>, which will generate or return a ready-to-use (signed url, url with encoded credentials etc) URL to bytes, via Authorization request header. It should not be used interchangeably with uri, they serve a different purpose.

In your example,

{
   "access_id": "drs://server.com/access/s3-1",
   "method": "s3",
   "region": "us-east-1"
}

Please note that access_id is unique for an object, not per DRS server unlike the DRS URL. So it would have to be drs://server.com/objects/<object_id>/access/s3-1.

Keeping that in mind, method and region info is redundant, since the same info would be available on drs://server.com/objects/<object_id>/. And if the file is moved to another region, you'd need to have a mechanism to ping the DRS who links it and update the redundant info.

Also it's ambiguous what token value for Authorization request header should be used in the linked DRS server URI, since it could be a different value.

If with DRS of DRSes we are aiming to redirect the client to another, this is indirect but not ambiguous:

{
   "uri": "drs://server.com/<object_id>",
   "method": "drs"
}

Alternatively, we could utilise the alias property of an object, to link/mirror another DRS URLs. GET /objects/<id>

{
   "id": 123,
   "name": "foo",
   "checksums": ["# list here"],
   "access_methods": ["# list here"],
   # rest of the props
   "alias": ["drs://server.com/<object_id>"]
}
dglazer commented 5 years ago

1) @sarpera, thanks for continuing to deep dive into OpenAPI syntax. Given what you found, I agree with the general direction of your latest proposal. Notes:

2) @susheel, re the format of access_id -- it did come up earlier, but I don't think we reached consensus. I think we all agree that the end goal is for callers to get an access_url, which they can use to directly fetch object bytes; the question is how they get that access-url.

The simple case is when only a single step is needed (e.g. for public content); in that case the server can return an `access_url` directly. The trickier case is when two steps are needed (e.g. for signed URLs).

The two-step pattern I prefer, mostly because the behavior feels more explicit, is:
  - servers can return an opaque `access_id` string, in any format they choose
  - callers pass that `access_id` to a well-defined method on the same server (e.g. `/objects/<id>/access/<access_id>`), which returns the `access_url`

I believe the pattern you're suggesting is:
  - servers can return an `access_url_url` (_name TBD_) string, which must be a fully resolvable HTTP GET'table path, and can be on any server  
  - callers do an HTTP GET on the `access_url_url`, which returns the actual `access_url`

**Is that right?** If so, I'm open to discussing the tradeoffs. But I agree with @sarpera that we shouldn't mix the patterns, and call the string an `access_id` if it's actually a fetchable URL.

3) I suggest we split discussion of the drs access_method into a separate issue/PR. I think whatever we end up with here will be able to support that use case, and it will be cleaner to discuss it separately. (I for one am still not clear on the requirements, but don't want to side-track this thread to dive in.)

dglazer commented 5 years ago

As discussed in #230, updating to OpenAPI 3.0 may take longer than we'd like, and I'm eager to get the changes discussed here into a PR. So it may make sense to decouple the issues, open up a PR for this issue now doing the best we can using v2, and then revisit whenever #230 is resolved.

I think that will be fine -- we can still use the access_methods syntax you propose above, but with weaker typing in the OpenAPI definition, so some of the rules for what parameters are valid where would be enforced by policy, not by schema. Not as elegant, but perfectly functional, and we can upgrade to stronger typing later.

I picture something like (without having tested it):

    AccessMethods:
      type: array
      description: The list of access methods that can be used to access the Data Object.
      minItems: 1
      items:     
        $ref: '#/components/schemas/AccessMethod'

    AccessMethod:
      type: object
      required:
        - method
      properties:
        method:
          type: string
          enum:
            - s3
            - gs
            - ftp
            - sftp
            - http
            - https
            - nfs
            - globus
            - aspera
            - gsiftp
            - nfs
            - local
        access_url:
          type: string
          description: >-
            A fully resolvable HTTP address that can be used to GET the actual object bytes.
            Note that at least one of access_url and access_id must be provided.
        access_id:
          type: string
          description: >-
            An arbitrary string to be passed to the /access method to fetch an access_url
        region:
          type: string
          description: >-
            Name of the region in the cloud service provider that the object belongs to.
          example:
            us-east-1

@sarpera -- wdyt? Are you up for creating a PR using OpenAPI v2 now, and confirming it's not too ugly?

susheel commented 5 years ago

I believe the pattern you're suggesting is:

  • servers can return an access_url_url (name TBD) string, which must be a fully resolvable HTTP GET'table path, and can be on any server
  • callers do an HTTP GET on the access_url_url, which returns the actual access_url

    Is that right? If so, I'm open to discussing the tradeoffs. But I agree with @sarpera that we shouldn't mix the patterns, and call the string an access_id if it's actually a fetchable URL.

@dglazer Yes, almost. I do agree that DRS must be able to support the two-phase access mechanism.

If the DRS server only provides an access_id it is implict in the service description to perform a subsequent GET /objects/<object-id>/access/<access-id> to get the access_url. Is there a use case where a user will only need the access_id? Making the access_url_url (your name :) more configurable or explicit in the service description would allow service providers to support external mechanisms to provide other ways to access_urls.

Either way, I agree with @sarpera we need to also iron out how AUTH tokens are specified and passed to /access or an access_url_url in #47 or #229

dglazer commented 5 years ago

@susheel , it sounds like we largely agree on the two options; good. @sarpera , I suggest you pick one (you know my vote), put it into the PR, and then we can discuss and finalize there.

A few comments on the details