DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

CWL: Deriving resource requirements from inputs fails #1647

Closed mr-c closed 7 years ago

mr-c commented 7 years ago

as noted in #1638

This is a follow on to #1540 #1621

mr-c commented 7 years ago

A braindump from @brainstorm and I :-)

We've narrowed down the problem: the job object made available to CWL expressions during the evaluation of any ResourceRequirement for a step in a CWL Workflow is in fact the job object (a.k.a the job object) of the CWL Workflow itself, and not the inputs for this step.

What we don't know how to do is to create/find the correct job object -- if we had it, we would make sure the builder on https://github.com/BD2KGenomics/toil/blob/7b1f22ecd3f8aea6d0a86ab56344d7d80bade4ab/src/toil/cwl/cwltoil.py#L247 was created using it

@tetron Any suggestions?

mr-c commented 7 years ago

Our plan:

  1. Trace what cwltool does with respect to parsing a ResourceRequirement and compare to the current Toil codepath(s).
  2. Later, to support non-FileJobStores like AWS: In cwltoil generate a valid CWL input job object without retrieving/copying files from Toil's jobstore inclusive of the size attribute for File objects using AbstractJobStore.getSize()
  3. @mr-c will add conformance test for this situation https://github.com/common-workflow-language/common-workflow-language/issues/346
  4. @mr-c will develop CWL extension to indicate that the otherwise optional size attribute on File objects is required for a particular CWL description (for example cwltool:FileSizeRequiredRequirement). This extension will be implemented as a namespaced and flag protected feature for cwltool so that Toil can know about it as well.
mr-c commented 7 years ago

cwltoil needs to know the resource requirements when building Toil's Job graph, though no jobs have run yet so we don't have any information on the outputs of the previous run.

See https://github.com/BD2KGenomics/toil/blob/7b1f22ecd3f8aea6d0a86ab56344d7d80bade4ab/src/toil/cwl/cwltoil.py#L264 https://github.com/BD2KGenomics/toil/blob/7b1f22ecd3f8aea6d0a86ab56344d7d80bade4ab/src/toil/job.py#L272 https://github.com/BD2KGenomics/toil/blob/7b1f22ecd3f8aea6d0a86ab56344d7d80bade4ab/src/toil/job.py#L56

Now our question is: for any given job, can we inject "fresher" resource requirements into the Toil Job object after the time the ancestor jobs are finished but before those resource requirements are used to schedule/reserve compute?

mr-c commented 7 years ago

See https://github.com/BD2KGenomics/toil/pull/1810 for a first pass of this (!!!)

tetron commented 7 years ago

1810 is merged, closing this issue.