Open ffinfo opened 6 years ago
I believe this might be the intended behavior. Since we treat subworkflow calls the same as a task call -- the outputs of a call can't be used until the call completes and only then can downstream calls continue. I can see the advantages of doing things differently for a subworkflow call -- @geoffjentry @danbills thoughts?
Yeah, this was a design choice. When we added subworkflows we considered whether it should be implemented as a self contained unit or if it should be implemented by just adding the nodes to the workflow graph. In the end, the pros/cons wound up leading to the current implementation
@geoffjentry @Horneth are there cons to changing the current implementation?
The pros / cons sort of mirror each other.
One pro of changing it is what @ffinfo suggested where you don't have to wait for the whole subworkflow to start downstream tasks.
However if your subworkflow is a coherent unit in the sense that it only really is successful if all of its calls complete successfully it might not be the desired behavior.
For example in the WDLs above, if task Cat
is an expensive operation and the sleep task ends up failing, you could potentially have wasted time running Cat
unnecessarily.
Of course this can be mitigated by having Cat
depend on Sleep
, but it's some sort of "fake" dependency.
To be fair this behavior already exists with scatters so it might not be that much of a deal, but I remember it was brought up at the time. I think people could be surprised either way.
Another not-quite-similar-but-related example is streaming of files from one task to the other, which CWL lets you specify explicitly (see streamable
field).
You could imagine a scenario like this (if WDL had a similar streamable concept):
task A {
command {
./my_script_generating_data.sh > streamable_out
echo "hello" > out
}
output {
File streamable_out = "streamable_out"
File out = "out"
}
parameter_meta {
streamable_out: {
"streamable": true
}
}
}
task B {
File in
command {
cat in | my_script_reading_data.sh
}
}
workflow w {
call A
call B { input: in = A.streamable_out }
call B as B2 { input: in = A.out }
}
Where A and B would actually run simultaneously but B2 would have to wait for A to complete.
That's also a nice option indeed but streams only work if they can really run next to each other. If there are multiple tasks depending on the same file this become more difficult I think.
Maybe a config value where the user can define if a subworkflow need to be completed or not to continue?
Maybe even better, a parameter_meta
field inside the subworkflow itself.
When working with submodules cromwell will only use outputs of a submodule when the complete submodule is finished.
In the example below the sleep command is blocking for the cat command while the files are not connected.
Backend: SGE or local
test1.wdl:
test2.wdl:
Command: