Closed mr-c closed 2 years ago
Thank you for your suggestion! Cromwell correctly co-localizes files in the same directory. We use cromwell almost exclusively and therefore there are some cromwellisms in our code. Thanks for pointing these out. Our aim is that tasks should run on any execution engine.
I checked the spec again and it also mentions:
Two input files that originated in the same storage directory must also be localized into the same directory for task execution (see the special case handling for Versioning Filesystems below).
So "WDL engines may place files wherever they want" is not the case. This is the same in the 1.0 spec that we follow (note the 1.0 version in the first line of the file). As long as the file and the index are in the same directory the localization should be correct.
@mr-c Can you describe what exact problem this PR solves in your case?
Two input files that originated in the same storage directory
Sure, but a user might have those files in separate directories, especially if they came from different steps.
Can you describe what exact problem this PR solves in your case?
We are using some of these tasks as tests cases for https://github.com/common-workflow-lab/wdl-cwl-translator
Sure, but a user might have those files in separate directories, especially if they came from different steps.
That is why we make sure the indexing step also returns the indexed file so these end up in the same directory: https://github.com/biowdl/tasks/blob/68c56c9f7cceab1658dfae39f8d130d8ee4fe71a/samtools.wdl#L266
But I can see your point. This will make the gatk.wdl more robust. Especially if indexing is dependent on other tasks. Review coming up.
Further reflection:
In all BioWDL tasks we made sure that either
The GVCFFiles workaround for instance seem to be very extensive in these PR. But the biowdl/tasks generate these gvcfs and index at the same time, so this problem does not occur. Therefore it is not needed to create a workaround for a problem that does not exist when only biowdl/tasks are used.
So is this a task input problem, or is this a problem with other tasks' output? In BioWDL we have ensured that the output of all tasks is correct, so we don't have to deal with these sorts of problems in the input. It is usually easier to deal with these problems at the output stage, rather than the input stage,. This keeps the code simple and therefore easier to maintain.
@rhpvorderman Thank you for your review!
From my amended commit message:
WDL engines may place files wherever they want as the rules at https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#task-input-localization only apply if the input files come from the same storage directory (which may not be the case due to the use of object stores like S3 or inputs from disparate steps)
This also helps with conversion from WDL to CWL as CWL requires explicit co-localization** if the CWL equivalent to WDL's command
doesn't include its own co-localizations.
** in the form of secondaryFiles
or via InitialWorkDirRequirement
, for which there are no WDL analogues
@mr-c
We've had an internal discussion with the BioWDL team about this PR. We are conflicted. We can see the value of this, but ultimately this makes the tasks more complicated.
These are the considerations for not merging this PR:
Writing mechanisms into every tasks that ensure colocalization of the inputs in every task is cumbersome and error-prone. These tasks are used in production everyday and it is best to keep the maintenance burden and the potential for errors as low as possible. As stated earlier, ensuring all tasks have a correct output provides a more straightforward and maintenable way to prevent errors with colocalization.
Thank you @rhpvorderman for your reviews. I guess I'll maintain a long-term fork over at https://github.com/mr-c/biowdl_tasks_cwlcompat#readme
If you could point me to your test data, I would appreciate it.
It is hard to test tasks itself (or was, I believe miniwdl now allows running of single tasks, so we could test them individually now.)
Our tasks are tested as part of workflows. For our up to date production workflows and their tests I suggest looking at: https://github.com/biowdl/germline-DNA and https://github.com/biowdl/rna-seq.
The tests use pytest-workflow, which uses YAML files to define the tests. This might be of interest to you for testing CWL workflows.
Our tasks are tested as part of workflows. For our up to date production workflows and their tests I suggest looking at: https://github.com/biowdl/germline-DNA and https://github.com/biowdl/rna-seq.
The tests use pytest-workflow, which uses YAML files to define the tests. This might be of interest to you for testing CWL workflows.
Cool! Reminds me of cwltest
which is used to run the CWL Conformance tests and my unfinished PR from 2018 to transform it into a pytest
plugin: https://github.com/common-workflow-language/cwltest/pull/74
Given the amount of overlap, we should consider merging the two projects . If you'd like to have a chat about that and related topics, I'd enjoy that!
Given the amount of overlap, we should consider merging the two projects .
I think the scope of the projects diverge too much for that. Pytest-workflow allows you to swap out your entire workflow written in WDL and rewrite it in snakemake, bash or whatever and still run and verify with minimal changes to the test files. Commands have to be put in the test files themselves. I.e. cromwell run -i inputs.json myworkflow.wdl
. Therefore choosing different engines dynamically is not supported. This is something that cwltest does support. But it is impossible to implement this in pytest-workflow without losing its workflow-agnostic scope.
The biggest virtue of pytest-workflow is that it is extremely simple. It loads a YAML, it executes a shell command using subprocess in a temporary folder. After the run it checks if the desired files are present. Due to this setup there is not much code, and the code that is there is bug-free. We have run thousands of tests with pytest-workflow already. Having something that reliable is a real asset.
I took a look at cwltest and it also looks very straightforward in its design. I like it. I think it is very valuable to keep these as two separate projects which both have very robust codebases and not merge them and create something more complex.
If you'd like to have a chat about that and related topics, I'd enjoy that!
I think I met you before (in passing) at the GCC2017. I was just starting my career in bioinformatics back then. We had a conversation about Debian if I remember correctly. I am looking forward to the next time we meet!
WDL engines may place files wherever they want (see https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#task-input-localization)
Checklist
parameter_meta
was added/updated (if required).