fhdsl / FH_WDL102_Workflows

Info about designing, optimizing and deploying your own WDL workflows
https://hutchdatascience.org/FH_WDL102_Workflows/
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

dry run? #15

Open jayoung opened 1 year ago

jayoung commented 1 year ago

here (the 'validate a workflow' section) it says "It does not perform a “dry run” or check to see if any of your inputs are actually available, only that it can interpret what you told it."

to me, an obvious question here is - is there such a thing as a 'dry run' with Cromwell/WDL? I have not (yet?) tackled learning snakemake but I think I know enough to have an idea of what 'dry run' means, and it seems like a very appealing concept.

If there is a way to do a dry run, would be great to tell us how. If not, also perhaps helpful to say it's just not a thing for WDLs.

same goes for the "check to see if any of your inputs are actually available" part - is there a capability for that?

vortexing commented 1 year ago

Good point!
Cromwell doesn't really have a dry run mode per se, but this validate process IS basically what those other tools consider to be a dry run. I edited it to be the following text?

This checks the format of your workflow files to make sure you have a valid file in a known format that Cromwell can interpret. This is called a "dry run" to ensure that your tasks are wired up correctly, but Cromwell does not try to see if any of your inputs are actually available, only that it can interpret what you told it. One of the reasons this is is that since Cromwell can pull files from local filesystems, AWS S3, Google buckets and Azure blobs, the process to test it's ability to actually get your inputs will happen while you run the workflow the first time. Luckily, Cromwell will only get file inputs it needs at that moment, and if it can't it won't do that specific task (but can continue with the other tasks it can do!).

jayoung commented 1 year ago

yes, that makes sense, and that's useful. Can I suggest a rewrite for clarity?

This checks your workflow files (wdl / jsons) to test:

This is called a "dry run".

Note that this does NOT test whether your input files are actually available, partly because Cromwell can pull files from local filesystems, AWS S3, Google buckets and Azure blobs. The process of testing input availability will only happen when you run the workflow for the first time. If some input files are missing, Cromwell will run tasks for the input files that ARE available, skipping tasks where inputs can't be found.

jayoung commented 1 year ago

I can see myself WANTING to check all the inputs before I start when a workflow is long, so that I can troubleshoot immediately rather than a day later. Is that something you sometimes do? e.g. I could imagine writing an additional task at the start of the workflow that checks for existence of ALL inputs from the workflow, and exits if one or more are missing.

Example - let say in diy-cromwell-server/testWorkflows/tg-wdl-VariantCaller you want to check for the annovar inputs like known_indels_sites_VCFs upfront, so that you know you have all your ducks in a row before running the whole thing.

vortexing commented 1 year ago

I have put in your text into WDL 101 now with PR fhdsl/FH_WDL101_Cromwell#35 . However the issue of actually testing for workflow inputs should live in WDL 102 I think. I'm not aware of a function in Cromwell that will do what you're wanting so I need to go explore a bit and see if it exists, and then document it in WDL 102 if it does. OR like you say, make a hacky "input tester" task (I actually have this for another reason), that you could copy and use at the beginning of your workflow to force localization of all inputs prior to running anything.