Open amtseng opened 3 years ago
Looked at two files but can't find any helpful information for debugging.
It looks like cromwell got SIGTERM
and gracefully shutdown itself.
2021-04-06 19:33:06,677 ERROR - Timed out trying to gracefully stop WorkflowStoreActor. Forcefully stopping it.
Can you upgrade Caper (which includes Cromwell version upgrade 52->59) and try again? Please follow upgrade instruction on Caper's release note.
$ pip3 install autouri caper --upgrade
I'll give that a try and report back. Thanks, Jin!
I've upgraded Caper/Cromwell (and verified the version update). Running on the same 25 jobs, I still get the exact same errors, and the Caper server crashes.
I then tried running just one job only. Intriguingly, it succeeded! So that suggests to me that either a subset of the jobs are crashing and causing the entire Caper server to crash and take the other jobs with them, or simply having too many jobs at a time is causing troubles...
Very strange! Any ideas? In the meantime, I'm going to try running a few more jobs on their own and see how that goes...
How did you run the server? Did you use Caper's shell script to make a server instance? https://github.com/ENCODE-DCC/caper/tree/master/scripts/gcp_caper_server
I started the server using this command in a tmux session:
caper server --port 8000 --gcp-loc-dir=gs://caper_out/amtseng/.caper_tmp --gcp-out-dir gs://caper_out/amtseng/
That command line looks good if your Google user account settings have enough permission to GCE, GCS and Google Life Sciences API and on on.
Why don't use a configuration file ~/.caper/default.conf
? You can make a good template of it by running the following:
# this will overwrite on the existing conf file. please make a backup if you need.
$ caper init gcp
BTW I strongly recommend to use the above shell script because ENCODE DCC runs thousands of pipeline without any problem on the instance created by that shell script.
Not sure if you have a service account with correct permissions settings. Please use the above script.
I generated the default configuration file using caper init gcp
, specifying only the gcp-prj
and gcp-out-dir
fields. I also started running a Caper server using just caper server
in a tmux session.
Caper still crashed, although the logs now have not only the pipeline dependencies not found
error, but I also see java.lang.OutOfMemoryError: GC overhead limit exceeded
errors.
I've attached cromwell.out
and an example workflow log, again.
cromwell.out.txt workflow.225a8edd-5ee7-45c2-b77f-d5123797d313.log.txt
It looks like Java memory issue?
java.sql.SQLException: java.lang.OutOfMemoryError: GC overhead limit exceeded
Thanks why I recommend the shell script. That script will make an instance with enough memory and all caper settings are automatically configured.
Ah, I'm sorry. I misunderstood which script you were referring to. I'll try to create an instance using create_instance.sh
instead of the pre-existing instance we have in the lab.
Describe the bug
I've submitted a good number (25) of ChIP-seq jobs to Caper, and the jobs begin running, but somehow halfway through, the Caper server dies suddenly. Examining the logs and grepping for "error", I find that all of the job logs (in
cromwell-workflow-logs/
) contain "Error: pipeline dependencies not found".I have consulted Issue #172, but I have verified that I have activated the
encode-chip-seq-pipeline
einvironment both when launching the Caper server and when submitting the jobs. I am also experiencing these issues on GCP, and not on MacOS, so I felt it was prudent to create a new issue for this.OS/Platform
Caper configuration file
Input JSON file
Here, I'm showing one of the 25 jobs submitted.
Troubleshooting result
Unfortunately, because the Caper server dies, I am unable to use
caper troubleshoot {jobID}
to diagnose. Instead, I've attached the cromwell log for the job. The end of this log is:I've also attached
cromwell.out
. workflow.3d1cb136-9b32-4514-9a33-3262d8303d6f.logcromwell.out
Thanks!