chanzuckerberg / miniwdl

Workflow Description Language developer tools & local runner
MIT License
175 stars 54 forks source link

task runtime: download URI input Files #260

Closed mlin closed 4 years ago

mlin commented 5 years ago
kislyuk commented 4 years ago

Hi @mlin -

@morsecodist on our team has done some very good work that could plug into this nicely. As you know, we have been using https://github.com/chanzuckerberg/s3mi, which is able to saturate 10GE but not necessarily 25GE. In my testing, unoptimized awscli and aria2c pull data from S3 at 200MB/s, and optimized aria2c can get to 500-700MB/s; s3mi gets 700-1000MB/s. However, s3mi is not as "off-the-shelf" as we would like.

@morsecodist did some in-depth profiling and wrote a tool based on the AWS Go SDK (https://github.com/chanzuckerberg/s3parcp). The tool is beautifully small (<100 SLOC) and leverages the parallelized S3 download functionality in the Go SDK to easily hit 2GB/s on a ramdisk and in excess of 1500MB/s on NVMe (confirmed in my own testing).

@morsecodist is now working on adding integrity control to this tool (my suggestion was to use crc32c, which is one of the few integrity control schemes that can keep up with this kind of speed).

This will form an important part of our strategy for staging large reference inputs at IDseq. My suggestion is that you actually look into using s3parcp in miniwdl. If we end up finding a way to use it for both uploads and downloads, we can make it write the crc32c checksum into s3 object metadata to do roundtrip integrity control with no extra fuss.

mlin commented 4 years ago

Nice- I think there's a cute way we can provide flexibility for exotic tools like this. In miniwdl there will be an extensible mapping from URI schemes to WDL tasks capable of localizing them. So for example the built-in mapping for https might be associated with,

aria2c.wdl

```WDL task aria2c { input { String url Int connections = 10 } command <<< set -euxo pipefail mkdir __out cd __out aria2c -x ~{connections} -s ~{connections} \ --file-allocation=none --retry-wait=2 --stderr=true \ "~{url}" >>> output { File file = glob("__out/*")[0] } runtime { docker: "hobbsau/aria2" } } ```

When run_local_task starts, it scans its inputs for Files with URI values. For each one, it recurses on itself to run the appropriate localizer task, and rewrites the input to the downloaded file, before proceeding with the normal task runtime procedure. Thus by bootstrapping from our own 'run this command in this docker image' capabilities, we could over time enable s3parcp for s3://, gsutil for gs://, etc. etc. without having to add all of these as miniwdl installation dependencies. I hope it's clearly natural we should start with aria2c for https:// for the first iteration.

Need to think through what the extensibility mechanism looks like, how the localizer tool gets any creds it needs, and customizing other parameters (like connections, above) and where the downloaded file gets stored.

kislyuk commented 4 years ago

That makes a lot of sense and works for us, as long as the extensibility is there and we can drop a s3parcp.wdl into the config and run our workflows.

To be clear, I wouldn't call this exotic. It's a wrapper around the AWS Go SDK for downloading from S3. That's a fairly vanilla application.

mlin commented 4 years ago

proof of concept WIP: https://github.com/chanzuckerberg/miniwdl/pull/279