HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

Create presigned URLs suitable for workflow execution #2100

Open mikebaumann opened 5 years ago

mikebaumann commented 5 years ago

As an external workflow service provider (or researcher using an external workflow service), I want to access HCA data using suitable presigned URLs, so that I can run my workflows/analysis on the data.

Presigned URLs for data access have many benefits, yet there are some important limitations of presigned URLs wrt workflow execution, some in general and some specific to the URLs currently produced by the DSS. The issues presented here primarily stem from collaboration with the Broad on the "HCA handoff to Terra", yet are generally/broadly applicable as well.

Terra/Cromwell does not currently support presigned URLs, yet according Alex Baumann (Broad) the Broad is open to considering such support (at some point in the future) if the key limitations of presigned URLs can be addressed/resolved. Notes from a meeting with Alex summarizing these limitations are available here: https://docs.google.com/document/d/18WDYr3uS82Eu1GhHmeH6PX7tFY6Zk8piKd3ixOoE8Bg/edit#heading=h.i4ag7sau4uy

The relative priority of this issue depends on the needs of end-users, the "HCA handoff to Terra" will use cloud native URLs, for which other issues have been created already.

In short, these limitations are:

Expiration Duration

Currently the presigned URLs produced by the DSS use an (implicit) expiration time of 1 hour for AWS and 1 day for GCP, yet analysis/workflows may require more time (Terra caps workflow execution at 7 days). This could be addressed by enhancing the DSS GET /files and GET /bundles (with presignedurls=true with an optional query parameter which specifies the presigned URL duration (up to 7 days, the maximum duration supported by AWS). The longer the expiration of the presigned URL the greater the security risk, so it is appropriate to allow the user to specify only the expiration duration they need. Also, I have heard and read (but not verified) that the actual presigned URL expiration of the DSS presigned URLs for AWS may actually be (substantially) less than one hour due to being signed by the assumed role of the Lambda. See also #2105

Signed by the Requester

For security/audibility, it is important that the presigned URLs be signed by the user's account, and not by a DSS user/role, to provide an audit trail of who is accessing the data. This is not a requirement per se for workflow execution, yet is an important consideration for presigned URL use for secure controlled data access in general. This will be more important when the DSS provides controlled access data, yet may be important even for security compliance. For controlled data access, it is also important to be able to revoke user access promptly (e.g. within 24 hours), and the mechanisms for doing that for presigned URLs are generally lacking.

Requester Pays

Support for requestor pays for presigned URLs may not be an important factor for the HCA program, yet it is a general consideration for the Broad adding support for presigned URLs. If the HCA data-store were to support requester pays, perhaps for use in other projects, this would need to be factored into the signing process (at least for AWS, don't know about Google).

xbrianh commented 5 years ago

See discussion here for correctly configuring S3 presigned url expiration.