broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
987 stars 357 forks source link

[EPIC] Support for file-of-file-names input type #1058

Closed kcibul closed 6 years ago

kcibul commented 8 years ago

As a pipeline author, sometimes I have pipelines which require lots of input files. For example, the joint calling pipeline at one point has a step where I need to combine genotypes from all the samples. This can be in the 10,000s range and growing.

Currently, this causes lots of problems. In Cromwell having that many inputs causes memory and database problems because parameters are first class citizens and there are so many of them. In addition they are file paths which can be very long. This can lead to GBs of footprint.

Similarly this causes problem for the underlying backend (e.g. JES) because of the volume. Recently our requests to JES were truncated because a load-balancer in front of the service had a maximum request size.

I would like to be able to instead specify a file, which is full of file names. My task will know what to do with this. I need a way to indicate this in wdl (perhaps a new type, like FOFN instead of File?). With this information, the Cromwell backend can do the correct thing during localization. For example in JES, we would tell the JES Api this is a FOFN.

Each backend would then need to handle this type. When receiving a FOFN input type the backend would first localize the FOFN and then iterate through the contents to localize each file. A new FOFN would then be rewritten to reference the local paths, and that FOFN would be used in place of the original FOFN as parameters to the tasks.

First, we should conduct a feasibility effort on this with a thought experiment on the joint calling workflow to see if FOFNs would solve the parameter space problem ( #1059 )

pgrosu commented 8 years ago

@kcibul I believe GATK can perform incremental joint calling, so then you should be able to use a collection of Cromwells submissions to build it up. Would that work?

kcibul commented 8 years ago

It's really a question of time and cost efficiency, but since we've got the actual GATK team and joint calling authors working on the pipeline we'll definitely take advantage of all the features there are (and they'll write the ones we need!)


Kristian Cibulskis Chief Architect, Data Sciences & Data Engineering Broad Institute of MIT and Harvard kcibul@broadinstitute.org

On Thu, Jun 23, 2016 at 11:55 AM, Paul Grosu notifications@github.com wrote:

@kcibul https://github.com/kcibul I believe GATK can perform incremental joint calling, so then you should be able to use a collection of Cromwells submissions to build it up. Would that work?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/1058#issuecomment-228096103, or mute the thread https://github.com/notifications/unsubscribe/ABW4gxuO55o58LyKHTfSEVvS97Z1-7Nxks5qOqxWgaJpZM4I8rmu .

pgrosu commented 8 years ago

Hi Kristian,

I understand, but what you're asking is very possible - see my previous discussion here about creating 1 billion simultaneous connections, and anything that is not accessible can be pre-cached via buckets during idle periods (i.e. nightly):

https://github.com/googlegenomics/utils-java/issues/62#issuecomment-220444203

So you should be able to create your own Pipeline implementation very easily via gloud create, VM metadata startup scripts and/or Dataflow Pipelines, and mimic JES:

https://cloud.google.com/sdk/gcloud/reference/compute/instances/create

https://cloud.google.com/deployment-manager/step-by-step-guide/setting-metadata-and-startup-scripts

https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline#applying-transforms-to-process-pipeline-data

If you look at the JES API, you'll notice most of it mirrors the gcloud commands and parameters:

https://www.googleapis.com/discovery/v1/apis/genomics/v1alpha2/rest

Again the concepts to speed up searches on dynamically streaming (processed) analysis results has a foundation via inverted indices, which search engines use all the time - I posted a couple of these here:

https://github.com/ga4gh/schemas/pull/253#issuecomment-97525342

https://github.com/ga4gh/schemas/issues/142#issuecomment-55518571

This way your searches are always fresh and would operate without any delay.

Hope it helps, Paul

katevoss commented 7 years ago

@kcibul As a part of the 11k Joint Genotyping effort this need for FOFN was solved, is there anything related that is missing as we scale to 20k and beyond?