broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 357 forks source link

Running with Slurm backend doubles stdout() printout #5932

Open OgnjenMilicevic opened 3 years ago

OgnjenMilicevic commented 3 years ago

Unless I configured something improperly, all output from stdout() is doubled when running with SLURM. Example pipeline:

version 1.0

# WORKFLOW DEFINITION
workflow WholeGenomeGermlineSingleSample {
call SumFloats
output {
Float out = SumFloats.total_size
}
}

task SumFloats {
  input {
    Array[Float] sizes = [1,2,3,4,5.0]
    Int preemptible_tries=3
  }

  command <<<
  python -c "print ~{sep="+" sizes}"
  >>>
  output {
    Float total_size = read_float(stdout())
  }
  runtime {
    docker: "us.gcr.io/broad-gotc-prod/python:2.7"
    preemptible: preemptible_tries
  }
}

The error raised with cromwell-53 is: Failed to read_float("/data/og/ted/cromwell-executions/WholeGenomeGermlineSingleSample/00090ef9-5211-4f18-9de9-daf3de791408/call-SumFloats/execution/stdout") (reason 1 of 1): For input string: "15.0 15.0" The stdout file truly contains this. Running with local backend returns no error. Contents of conf file:

backend {
  default = "SLURM"
  providers {
    Local {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        include required(classpath("reference_local_provider_config.inc.conf"))
        concurrent-job-limit = 30
      }
    }
    SLURM {
  actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
  config {
    runtime-attributes = """
    Int runtime_minutes = 600
    Int cpu = 1
    Int requested_memory_mb_per_core = 8000
    Int memory_mb = 4000
    String queue = "short"
    String? docker
    """

    submit = """
        sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${runtime_minutes} -p ${queue} \
        ${"-c " + cpu} \
        --mem ${memory_mb} \
        --wrap "/bin/bash ${script}"
    """
    submit-docker = """
        docker pull ${docker}

        sbatch -J ${job_name} -D ${cwd} -o ${cwd}/execution/stdout -e ${cwd}/execution/stderr -t ${runtime_minutes} -p ${queue} \
        ${"-c " + cpu} \
        --mem ${memory_mb} \
        --wrap "docker run -v ${cwd}:${docker_cwd} ${docker} ${job_shell} ${docker_cwd}/execution/script"
    """

    kill = "scancel ${job_id}"
    check-alive = "scontrol show job ${job_id}"
    job-id-regex = "Submitted batch job (\\d+).*"
  }
}

Any thoughts?

SauDan commented 1 year ago

I find that the wrapper bash script (.../execution/script) that Cromwell generates tries to capture stdout and stderr in a convoluted way:

So, both copies generated by "tee" end up in .../exection/stdout! The output is duplicated! This causes problems with subsequent steps in the WDL script.

To work around this, I've changed the -o and -e options to:

-o ${out}.slurm -e ${err}.slurm

noting that ${out} has the same value as ${cwd}/execution/stdout in my environment.