force co-localization of indexes with their targets

mr-c commented 3 years ago

WDL engines may place files wherever they want (see https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#task-input-localization)

Checklist

[x] Pull request details were added to CHANGELOG.md.
[x] Documentation was updated (if required).
[x] parameter_meta was added/updated (if required).

rhpvorderman commented 2 years ago

Thank you for your suggestion! Cromwell correctly co-localizes files in the same directory. We use cromwell almost exclusively and therefore there are some cromwellisms in our code. Thanks for pointing these out. Our aim is that tasks should run on any execution engine.

rhpvorderman commented 2 years ago

I checked the spec again and it also mentions:

Two input files that originated in the same storage directory must also be localized into the same directory for task execution (see the special case handling for Versioning Filesystems below).

So "WDL engines may place files wherever they want" is not the case. This is the same in the 1.0 spec that we follow (note the 1.0 version in the first line of the file). As long as the file and the index are in the same directory the localization should be correct.

@mr-c Can you describe what exact problem this PR solves in your case?

mr-c commented 2 years ago

Two input files that originated in the same storage directory

Sure, but a user might have those files in separate directories, especially if they came from different steps.

Can you describe what exact problem this PR solves in your case?

We are using some of these tasks as tests cases for https://github.com/common-workflow-lab/wdl-cwl-translator

rhpvorderman commented 2 years ago

Sure, but a user might have those files in separate directories, especially if they came from different steps.

That is why we make sure the indexing step also returns the indexed file so these end up in the same directory: https://github.com/biowdl/tasks/blob/68c56c9f7cceab1658dfae39f8d130d8ee4fe71a/samtools.wdl#L266

But I can see your point. This will make the gatk.wdl more robust. Especially if indexing is dependent on other tasks. Review coming up.

rhpvorderman commented 2 years ago

Further reflection:

In all BioWDL tasks we made sure that either

The indexing steps present both the Index and the indexedFile, so this problem does not occur
Indexing is part of the step that generates the file in the first place. So this problem does not occur.

The GVCFFiles workaround for instance seem to be very extensive in these PR. But the biowdl/tasks generate these gvcfs and index at the same time, so this problem does not occur. Therefore it is not needed to create a workaround for a problem that does not exist when only biowdl/tasks are used.

So is this a task input problem, or is this a problem with other tasks' output? In BioWDL we have ensured that the output of all tasks is correct, so we don't have to deal with these sorts of problems in the input. It is usually easier to deal with these problems at the output stage, rather than the input stage,. This keeps the code simple and therefore easier to maintain.

mr-c commented 2 years ago

@rhpvorderman Thank you for your review!

From my amended commit message:

WDL engines may place files wherever they want as the rules at https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#task-input-localization only apply if the input files come from the same storage directory (which may not be the case due to the use of object stores like S3 or inputs from disparate steps)

This also helps with conversion from WDL to CWL as CWL requires explicit co-localization** if the CWL equivalent to WDL's command doesn't include its own co-localizations.

** in the form of secondaryFiles or via InitialWorkDirRequirement, for which there are no WDL analogues

rhpvorderman commented 2 years ago

@mr-c

We've had an internal discussion with the BioWDL team about this PR. We are conflicted. We can see the value of this, but ultimately this makes the tasks more complicated.

These are the considerations for not merging this PR:

For the GVCF files the indexes should be created by the GATK HaplotypeCaller and in the same directory already
For the BAM and Reference files the indexes should be in the same directory. If not it is more easy to create a separate task that correctly places the files in the same directory via hard-linking and outputs those in its output section.
When biowdl/tasks are used the outputted indexes are always correctly placed in the same directory as the file.

Writing mechanisms into every tasks that ensure colocalization of the inputs in every task is cumbersome and error-prone. These tasks are used in production everyday and it is best to keep the maintenance burden and the potential for errors as low as possible. As stated earlier, ensuring all tasks have a correct output provides a more straightforward and maintenable way to prevent errors with colocalization.

mr-c commented 2 years ago

Thank you @rhpvorderman for your reviews. I guess I'll maintain a long-term fork over at https://github.com/mr-c/biowdl_tasks_cwlcompat#readme

If you could point me to your test data, I would appreciate it.

rhpvorderman commented 2 years ago

It is hard to test tasks itself (or was, I believe miniwdl now allows running of single tasks, so we could test them individually now.)

Our tasks are tested as part of workflows. For our up to date production workflows and their tests I suggest looking at: https://github.com/biowdl/germline-DNA and https://github.com/biowdl/rna-seq.

The tests use pytest-workflow, which uses YAML files to define the tests. This might be of interest to you for testing CWL workflows.

mr-c commented 2 years ago

Our tasks are tested as part of workflows. For our up to date production workflows and their tests I suggest looking at: https://github.com/biowdl/germline-DNA and https://github.com/biowdl/rna-seq.

The tests use pytest-workflow, which uses YAML files to define the tests. This might be of interest to you for testing CWL workflows.

Cool! Reminds me of cwltest which is used to run the CWL Conformance tests and my unfinished PR from 2018 to transform it into a pytest plugin: https://github.com/common-workflow-language/cwltest/pull/74

Given the amount of overlap, we should consider merging the two projects . If you'd like to have a chat about that and related topics, I'd enjoy that!

rhpvorderman commented 2 years ago

Given the amount of overlap, we should consider merging the two projects .

I think the scope of the projects diverge too much for that. Pytest-workflow allows you to swap out your entire workflow written in WDL and rewrite it in snakemake, bash or whatever and still run and verify with minimal changes to the test files. Commands have to be put in the test files themselves. I.e. cromwell run -i inputs.json myworkflow.wdl. Therefore choosing different engines dynamically is not supported. This is something that cwltest does support. But it is impossible to implement this in pytest-workflow without losing its workflow-agnostic scope.

The biggest virtue of pytest-workflow is that it is extremely simple. It loads a YAML, it executes a shell command using subprocess in a temporary folder. After the run it checks if the desired files are present. Due to this setup there is not much code, and the code that is there is bug-free. We have run thousands of tests with pytest-workflow already. Having something that reliable is a real asset.

I took a look at cwltest and it also looks very straightforward in its design. I like it. I think it is very valuable to keep these as two separate projects which both have very robust codebases and not merge them and create something more complex.

If you'd like to have a chat about that and related topics, I'd enjoy that!

I think I met you before (in passing) at the GCC2017. I was just starting my career in bioinformatics back then. We had a conversation about Debian if I remember correctly. I am looking forward to the next time we meet!

biowdl / tasks

force co-localization of indexes with their targets #291

Checklist