broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 357 forks source link

Support Presigned URLs as resolved URL #3817

Open danbills opened 6 years ago

danbills commented 6 years ago

Introduction

The essence of a presigned URL is that it gives you privileged access to data (via HTTP verbs, usually GET) for a finite amount of time. Some metadata can be obtained via the HEAD verb.

DOS URI's can be resolved to presigned URLs, and it's not immediately obvious how to provide the info Cromwell needs to do its job. Hence this document.

The essence of this question is how do we leverage HTTP.

Information Needed for Cromwell to work

  1. The data itself, i.e. the file to which the URL refers.
  2. Size Metadata
  3. Hash Metadata
  4. Byte-level access (needed for things like WDL's read_lines)

Information Provided by OpenDJ / Martha as of 6/25/18

Information provided by HTTP (in theory)

Information Not Provided by OpenDJ/Martha as of 6/25/18

Outstanding questions (please comment if you have info)

  1. What metadata can be obtained via HEAD?
  2. Is the HEAD metadata a standard, and do all clouds implement that standard? (I think ETag is common name for this info.)
  3. How does call-caching work with an expiration date on the URL?
  4. Byte-level access: HTTP request to the data can be limited to a range via Range header. Do clouds support this feature? Are there other ways of achieving this requirement?
  5. Write access: WDL supports write_lines, which AFAIK is only possible via PATCH
  6. Can Cromwell use any hash besides CRC32? If not how do we obtain CRC32 reliably?
cjllanwarne commented 6 years ago

re 5: Write access: we have to stage stuff in CWL too - maybe even more so than we do in WDL

Regardless, my comment would be that we don't necessarily need to write to the same FS that inputs are coming from - eg if we're running on PAPI would could "write" to gs://... even if most inputs are coming from https://...

cjllanwarne commented 6 years ago

re 6: hashes beside CRC32 - yes we can use anything. Only if we want to call cache between tasks from different FS's do we need to standardize.

That's not been a problem for now between local (md5) and GCS (CRC32C) because we'd never call cache between local and PAPI anyway

kcibul commented 6 years ago

3: call caching: in order to make the cache hit, we only need to obtain the MD5 from the input and match it to something run before. If we can get this from the supplied psURL then we have the md5 and can match internally.

As long as we are not using psURLs as destinations (e.g. we are still writing task outputs to the cromwell execution bucket) performing the "hit" (e.g. doing the copy/reference) shouldn't be affected by psURLs.

kcibul commented 6 years ago

1: on google, generating a psURL and calling HEAD on it (which you can also do with a GET and only as for the 1st byte)

HTTP/2 200 x-guploader-uploadid: AEnB2Uo10d8ECr7tR5601R8roi8MIXlzvg1rjyMui9wavFC7KO2Pv2QBk94Qv22mgAz5Ih0nnayc2kXj5XBFgRUqkNTJNtAo7Q expires: Fri, 29 Jun 2018 15:56:42 GMT date: Fri, 29 Jun 2018 15:56:42 GMT cache-control: private, max-age=0 last-modified: Fri, 29 Jun 2018 15:53:49 GMT etag: "09f7e02f1290be211da707a266f153b3" x-goog-generation: 1530287629024005 x-goog-metageneration: 1 x-goog-stored-content-encoding: identity x-goog-stored-content-length: 6 content-type: text/plain content-language: en x-goog-hash: crc32c=sMnOMw== x-goog-hash: md5=CffgLxKQviEdpweiZvFTsw== x-goog-storage-class: STANDARD accept-ranges: bytes content-length: 6 server: UploadServer alt-svc: quic=":443"; ma=2592000; v="43,42,41,39,35"