Pipeline on GCP fails with "Error: pipeline dependencies not found"

amtseng commented 3 years ago

Describe the bug

I've submitted a good number (25) of ChIP-seq jobs to Caper, and the jobs begin running, but somehow halfway through, the Caper server dies suddenly. Examining the logs and grepping for "error", I find that all of the job logs (in cromwell-workflow-logs/) contain "Error: pipeline dependencies not found".

I have consulted Issue #172, but I have verified that I have activated the encode-chip-seq-pipeline einvironment both when launching the Caper server and when submitting the jobs. I am also experiencing these issues on GCP, and not on MacOS, so I felt it was prudent to create a new issue for this.

OS/Platform

OS/Platform: Google Cloud
Conda version: 4.7.12
Pipeline version: I'm not sure how to check this, sorry
Caper version: 1.4.2

Caper configuration file

backend=gcp
gcp-prj=gbsc-gcp-lab-kundaje
tmp-dir=/data/tmp_amtseng
singularity-cachedir=/data/singularity_cachedir_amtseng
file-db=/data/caper_db/caper_file_db_amtseng
db-timeout=120000
max-concurrent-tasks=1000
max-concurrent-workflows=50
use-google-cloud-life-sciences=True
gcp-region=us-central1

Input JSON file

Here, I'm showing one of the 25 jobs submitted.

{
  "chip.title": "A549_cJun_FLAG cells untreated",
  "chip.description": "A549_cJun_FLAG cells untreated",

  "chip.pipeline_type": "tf",

  "chip.aligner": "bowtie2",
  "chip.align_only": false,
  "chip.true_rep_only": false,

  "chip.genome_tsv": "https://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v3/hg38.tsv",

  "chip.paired_end": false,
  "chip.ctl_paired_end": false,

  "chip.always_use_pooled_ctl": true,

  "chip.align_cpu": 4,
  "chip.call_peak_cpu": 4,

  "chip.fastqs_rep1_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090532.fastq.gz"
  ],
  "chip.fastqs_rep2_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090533.fastq.gz"
  ],
  "chip.fastqs_rep3_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090534.fastq.gz"
  ],

  "chip.ctl_fastqs_rep1_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090601.fastq.gz"
  ],
  "chip.ctl_fastqs_rep2_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090602.fastq.gz"
  ],
  "chip.ctl_fastqs_rep3_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090603.fastq.gz"
  ]
}

Troubleshooting result

Unfortunately, because the Caper server dies, I am unable to use caper troubleshoot {jobID} to diagnose. Instead, I've attached the cromwell log for the job. The end of this log is:

I've also attached cromwell.out. workflow.3d1cb136-9b32-4514-9a33-3262d8303d6f.log

cromwell.out

Thanks!

leepc12 commented 3 years ago

Looked at two files but can't find any helpful information for debugging. It looks like cromwell got SIGTERM and gracefully shutdown itself.

2021-04-06 19:33:06,677  ERROR - Timed out trying to gracefully stop WorkflowStoreActor. Forcefully stopping it.

Can you upgrade Caper (which includes Cromwell version upgrade 52->59) and try again? Please follow upgrade instruction on Caper's release note.

$ pip3 install autouri caper --upgrade

amtseng commented 3 years ago

I'll give that a try and report back. Thanks, Jin!

amtseng commented 3 years ago

I've upgraded Caper/Cromwell (and verified the version update). Running on the same 25 jobs, I still get the exact same errors, and the Caper server crashes.

I then tried running just one job only. Intriguingly, it succeeded! So that suggests to me that either a subset of the jobs are crashing and causing the entire Caper server to crash and take the other jobs with them, or simply having too many jobs at a time is causing troubles...

Very strange! Any ideas? In the meantime, I'm going to try running a few more jobs on their own and see how that goes...

leepc12 commented 3 years ago

How did you run the server? Did you use Caper's shell script to make a server instance? https://github.com/ENCODE-DCC/caper/tree/master/scripts/gcp_caper_server

amtseng commented 3 years ago

I started the server using this command in a tmux session:

caper server --port 8000 --gcp-loc-dir=gs://caper_out/amtseng/.caper_tmp --gcp-out-dir gs://caper_out/amtseng/

leepc12 commented 3 years ago

That command line looks good if your Google user account settings have enough permission to GCE, GCS and Google Life Sciences API and on on.

Why don't use a configuration file ~/.caper/default.conf? You can make a good template of it by running the following:

# this will overwrite on the existing conf file. please make a backup if you need.
$ caper init gcp

BTW I strongly recommend to use the above shell script because ENCODE DCC runs thousands of pipeline without any problem on the instance created by that shell script.

Not sure if you have a service account with correct permissions settings. Please use the above script.

amtseng commented 3 years ago

I generated the default configuration file using caper init gcp, specifying only the gcp-prj and gcp-out-dir fields. I also started running a Caper server using just caper server in a tmux session. Caper still crashed, although the logs now have not only the pipeline dependencies not found error, but I also see java.lang.OutOfMemoryError: GC overhead limit exceeded errors.

I've attached cromwell.out and an example workflow log, again.

cromwell.out.txt workflow.225a8edd-5ee7-45c2-b77f-d5123797d313.log.txt

leepc12 commented 3 years ago

It looks like Java memory issue?

java.sql.SQLException: java.lang.OutOfMemoryError: GC overhead limit exceeded

Thanks why I recommend the shell script. That script will make an instance with enough memory and all caper settings are automatically configured.

amtseng commented 3 years ago

Ah, I'm sorry. I misunderstood which script you were referring to. I'll try to create an instance using create_instance.sh instead of the pre-existing instance we have in the lab.

ENCODE-DCC / chip-seq-pipeline2