broadinstitute / wdl-runner

Easily run WDL workflows on GCP
BSD 3-Clause "New" or "Revised" License
13 stars 11 forks source link

Wdl-runner hanging on Whole Genome Sequence test #21

Open johan-grupodot opened 3 years ago

johan-grupodot commented 3 years ago

Hello Broad Institute team,

I have been following this tutorial to run GATK Best Practices in Google Cloud Platform using Google Cloud Life Sciences:

https://cloud.google.com/life-sciences/docs/tutorials/gatk

This tutorial runs the workflow PairedEndSingleSampleWf.wdl that can be found in:

https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels

Without any changes, the workflow finished successfully. However, I need to run the same workflow for a use case of Whole Genome Sequencing (WGS). To achieve this I changed the key "PairedEndSingleSampleWorkflow.flowcell_unmapped_bams" in the input file "PairedEndSingleSampleWf.hg38.inputs.json" to point to the WGS read groups found in the Broad Institute public bucket:

"PairedEndSingleSampleWorkflow.flowcell_unmapped_bams": ["gs://broad-public-datasets/NA12878/unmapped/HJYFJCCXX.4.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HJYFJCCXX.5.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HJYFJCCXX.6.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HJYFJCCXX.7.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HJYFJCCXX.8.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HJYN2CCXX.1.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.1.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.2.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.3.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.4.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.5.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.6.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.7.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35MCCXX.8.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35NCCXX.1.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK35NCCXX.2.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.1.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.2.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.3.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.4.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.5.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.6.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.7.Pond-492100.unmapped.bam",

"gs://broad-public-datasets/NA12878/unmapped/HK3T5CCXX.8.Pond-492100.unmapped.bam"]

First attempt (Failed)

I ran the workflow with the new input files while keeping the options and wdl file without changes. And it failed in the task ValidateSamFile:


2020-12-03 10:22:42,549 cromwell-system-akka.dispatchers.backend-dispatcher-17814 INFO  - PipelinesApiAsyncBackendJobExecutionActor [UUID(1abfc19a)PairedEndSingleSampleWorkflow.ValidateCram:NA:3]: Status change from Running to Success
2020-12-03 10:22:45,913 cromwell-system-akka.dispatchers.engine-dispatcher-35 INFO  - WorkflowManagerActor Workflow 1abfc19a-905b-4632-b90c-4d1be258bc5b failed (during ExecutingWorkflowState): java.lang.Exception: The compute backend terminated the job. If this termination is unexpected, examine likely causes such as preemption, running out of disk or memory on the compute instance, or exceeding the backend's maximum job duration. 

Debugging ValidateSamFile

Changing the pre-emptible attempts to zero, I ran ValidateSamFile alone inside another workflow and got the following error:

ValidateCramWorkflow.ValidateSamFile:NA:1 failed. The job was stopped before the command finished. PAPI error code 10. The assigned worker has failed to complete the operation Then I changed the task ValidateSamFile to have more memory and do not make attempts on pre-emptible machines and it worked successfully:

task ValidateSamFile {
  File input_bam
  File? input_bam_index
  String report_filename
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  Int? max_output
  Array[String]? ignore
  Boolean? is_outlier_data
  Float disk_size
  Int preemptible_tries

  command {
    java -Xms10000m -Xmx10000m -jar /usr/gitc/picard.jar \
      ValidateSamFile \
      INPUT=${input_bam} \
      OUTPUT=${report_filename} \
      REFERENCE_SEQUENCE=${ref_fasta} \
      ${"MAX_OUTPUT=" + max_output} \
      IGNORE=${default="null" sep=" IGNORE=" ignore} \
      MODE=VERBOSE \
      ${default='SKIP_MATE_VALIDATION=false' true='SKIP_MATE_VALIDATION=true' false='SKIP_MATE_VALIDATION=false' is_outlier_data} \
      IS_BISULFITE_SEQUENCED=false
  }
  runtime {
    preemptible: preemptible_tries
    memory: "16 GB"
    disks: "local-disk " + sub(disk_size, "\\..*", "") + " HDD"
  }
  output {
    File report = "${report_filename}"
  }
}

Second attempt (Cancelled)

I applied the changes to ValidateSamFile task inside the complete workflow, and then repeated the execution process.

The temporal outputs were saved in Cloud Storage, and they were completed after 23 hours 15 minutes. I checked 2 days later after this event and the final output folder did not have the complete final output files and the initial Virtual Machine instance created by Google Cloud Life Sciences with wdl_runner was still running (even the .g.vcf found in the temporal files was not in the final output folder).

I had to kill the workflow because it kept using resources (1% of the CPU of the initial Virtual Machine Instance). Even the log file was not created.

Questions

  1. Is there any way to troubleshoot the problem of having the workflow hanging without finishing?
  2. Do I need to make additional changes to the workflow to run with complete WGS?

Additional remarks

  1. Modified workflow ran successfully on the original inputs in the tutorial.
  2. using the wdl-runner monitoring tool after the 2 days gave the following message: Transitioning to next stage or copying final output Thank you for your attention.

Greetings,

Johan