broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 361 forks source link

Cromwell waits on all jobs in a subworkflow even if they are not used #3814

Open ffinfo opened 6 years ago

ffinfo commented 6 years ago

When working with submodules cromwell will only use outputs of a submodule when the complete submodule is finished.

In the example below the sleep command is blocking for the cat command while the files are not connected.

Backend: SGE or local

test1.wdl:

workflow Test1 {
    call Echo
    call Sleep

    output {
        File echoOut = Echo.out
        File sleepOut = Sleep.out
    }
}

task Echo {
    command{
        echo bla > bla.txt
    }
    output {
        File out = "bla.txt"
    }
}

task Sleep {
    command{
        sleep 30 > bla.txt
    }
    output {
        File out = "bla.txt"
    }
}

test2.wdl:

import "test1.wdl" as Test1

workflow Test2 {
    call Test1.Test1 as Test1
    call Cat {
        input:
            inFile = Test1.echoOut
    }
}

task Cat {
    File inFile

    command{
        cat ${inFile} > "bla.txt"
    }
    output {
        File out = "bla.txt"
    }
}

Command:

# no config is given
java -jar <cromwell_32.jar> run test2.wdl
ruchim commented 6 years ago

I believe this might be the intended behavior. Since we treat subworkflow calls the same as a task call -- the outputs of a call can't be used until the call completes and only then can downstream calls continue. I can see the advantages of doing things differently for a subworkflow call -- @geoffjentry @danbills thoughts?

geoffjentry commented 6 years ago

Yeah, this was a design choice. When we added subworkflows we considered whether it should be implemented as a self contained unit or if it should be implemented by just adding the nodes to the workflow graph. In the end, the pros/cons wound up leading to the current implementation

ruchim commented 6 years ago

@geoffjentry @Horneth are there cons to changing the current implementation?

Horneth commented 6 years ago

The pros / cons sort of mirror each other. One pro of changing it is what @ffinfo suggested where you don't have to wait for the whole subworkflow to start downstream tasks. However if your subworkflow is a coherent unit in the sense that it only really is successful if all of its calls complete successfully it might not be the desired behavior. For example in the WDLs above, if task Cat is an expensive operation and the sleep task ends up failing, you could potentially have wasted time running Cat unnecessarily. Of course this can be mitigated by having Cat depend on Sleep, but it's some sort of "fake" dependency.

To be fair this behavior already exists with scatters so it might not be that much of a deal, but I remember it was brought up at the time. I think people could be surprised either way.

Another not-quite-similar-but-related example is streaming of files from one task to the other, which CWL lets you specify explicitly (see streamable field). You could imagine a scenario like this (if WDL had a similar streamable concept):


task A {
  command {
     ./my_script_generating_data.sh > streamable_out
     echo "hello" > out
  }
  output {
     File streamable_out = "streamable_out"
     File out = "out"
  }
  parameter_meta {
     streamable_out: {
        "streamable": true
     }
  }
}

task B {
  File in
  command {
     cat in | my_script_reading_data.sh
  }
}

workflow w {
   call A
   call B { input: in = A.streamable_out }
   call B as B2 { input: in = A.out }
}

Where A and B would actually run simultaneously but B2 would have to wait for A to complete.

ffinfo commented 6 years ago

That's also a nice option indeed but streams only work if they can really run next to each other. If there are multiple tasks depending on the same file this become more difficult I think.

Maybe a config value where the user can define if a subworkflow need to be completed or not to continue?

ffinfo commented 6 years ago

Maybe even better, a parameter_meta field inside the subworkflow itself.