Closed Baharehh closed 4 years ago
Change directory into your output directory (out-dir
or CWD where you ran pipelines).
$ cd [YOUR_OUTPUT_DIR]
$ find -name "metadata.json"
of course, I did, but it doesn't find it. I also manually searched and cannot find it. I need to find it and post it for you so you can let me know why all the runes are unfinished. I attached my input JSON. This example file which ran with the old pipeline, now it doesn't run ...
test-tmuxPath-BH-iN-atac-s1s2-auto-adapto-true-remove-Jin2.txt
Please post a full screen log from caper run
.
I am copy-pasting part of an output from an ongoing atac pipe run right now which seen to be halted.. I am also attaching metadata.JSON from a ChIP run that didn't ran with the new pipe. Thanks! metadata.txt
2020-02-27 07:16:50,343 cromwell-system-akka.dispatchers.backend-dispatcher-93 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: executing: /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-1/execution/script # if [ -z "$SINGULARITY_BINDPATH" ]; then export SINGULARITY_BINDPATH=; fi; if [ -z "$SINGULARITY_CACHEDIR" ]; then export SINGULARITY_CACHEDIR=; fi; singularity exec --cleanenv --home /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-1 /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-1/execution/script 2020-02-27 07:16:50,389 cromwell-system-akka.dispatchers.backend-dispatcher-96 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:0:1]: executing: /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-0/execution/script # if [ -z "$SINGULARITY_BINDPATH" ]; then export SINGULARITY_BINDPATH=; fi; if [ -z "$SINGULARITY_CACHEDIR" ]; then export SINGULARITY_CACHEDIR=; fi; singularity exec --cleanenv --home /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-0 /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-0/execution/script 2020-02-27 07:16:52,533 cromwell-system-akka.dispatchers.backend-dispatcher-100 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: job id: 26212 2020-02-27 07:16:52,538 cromwell-system-akka.dispatchers.backend-dispatcher-93 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align:0:1]: job id: 26193 2020-02-27 07:16:52,773 cromwell-system-akka.dispatchers.backend-dispatcher-93 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: Status change from - to WaitingForReturnCode 2020-02-27 07:16:52,782 cromwell-system-akka.dispatchers.backend-dispatcher-94 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align:0:1]: Status change from - to WaitingForReturnCode 2020-02-27 07:16:57,521 cromwell-system-akka.dispatchers.backend-dispatcher-91 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:0:1]: job id: 26234 2020-02-27 07:16:58,004 cromwell-system-akka.dispatchers.backend-dispatcher-91 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.alignmito:0:1]: Status change from - to WaitingForReturnCode ^^_^_2020-02-27 09:15:51,700 cromwell-system-akka.dispatchers.backend-dispatcher-195 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:0:1]: Status change from WaitingForReturnCode to Done 2020-02-27 09:30:00,972 cromwell-system-akka.dispatchers.backend-dispatcher-196 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: Status change from WaitingForReturnCode to Done
caper run
automatically runs a troubleshooter before closing to parse metadata.json
to find a reason for failure.
Your metadata.json
says that
"failures": [
{
"message": "Workflow failed",
"causedBy": [
{
"message": "Job chip.call_peak_pr1:1:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
"causedBy": []
},
{
"message": "Job chip.call_peak_ppr1:NA:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
"causedBy": []
},
{
"message": "Job chip.call_peak_pr2:0:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
"causedBy": []
},
{
"message": "Job chip.macs2_signal_track_pooled:NA:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
"causedBy": []
},
{
"causedBy": [],
"message": "Job chip.call_peak:0:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details."
},
{
"causedBy": [],
"message": "Job chip.call_peak:1:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details."
}
]
}
],
Look at the following STDERR files and check what happened.
macs2: /farmshare/user_data/baharehh/chip-seq-pipeline2/chip/90144a57-b1fc-4f25-a8cb-3885695d9eea/call-macs2_signal_track_pooled/execution/stderr
call-peak: /farmshare/user_data/baharehh/chip-seq-pipeline2/chip/90144a57-b1fc-4f25-a8cb-3885695d9eea/call-call_peak_ppr1/execution/stderr
I attach stderr from a run that failed and also progressed enough to make these 2 files- the above run didn't spit out those 2 files. stderr-2.txt stderr.txt
out those files.
please take a look. It doesn't give me a reason why it failed.
I think nth=4 is so small number and we could increase the core number? say up to 8? I know i have ~ 16 in my system ...?
No, increase memory instead of cpu (nth). Find default memory for the following variables in the input documentation. Try doubling them in your input JSON.
atac.call_peak_mem_mb
atac.macs2_signal_track_mem_mb
Thanks Jin, I fixed the memory issue, but the atac pipline still halted since 2 pm with below "partial" output ( the chip pipe also halted) :
2020-02-27 14:21:36,524 cromwell-system-akka.dispatchers.backend-dispatcher-101 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align:0:1]: job id: 4567 2020-02-27 14:21:36,524 cromwell-system-akka.dispatchers.backend-dispatcher-102 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:1:1]: job id: 4594 2020-02-27 14:21:36,626 cromwell-system-akka.dispatchers.backend-dispatcher-97 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:0:1]: executing: /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/0d355764-5dc3-4791-ac03-b4155b727957/call-align_mito/shard-0/execution/script # if [ -z "$SINGULARITY_BINDPATH" ]; then export SINGULARITY_BINDPATH=; fi; if [ -z "$SINGULARITY_CACHEDIR" ]; then export SINGULARITY_CACHEDIR=; fi; singularity exec --cleanenv --home /farmshare/user_data/baharehh/atac-seq-pipeline/atac/0d355764-5dc3-4791-ac03-b4155b727957/call-align_mito/shard-0 /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/0d355764-5dc3-4791-ac03-b4155b727957/call-align_mito/shard-0/execution/script 2020-02-27 14:21:36,692 cromwell-system-akka.dispatchers.backend-dispatcher-106 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:1:1]: Status change from - to WaitingForReturnCode 2020-02-27 14:21:36,692 cromwell-system-akka.dispatchers.backend-dispatcher-99 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align:0:1]: Status change from - to WaitingForReturnCode 2020-02-27 14:21:41,524 cromwell-system-akka.dispatchers.backend-dispatcher-106 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:0:1]: job id: 4627 2020-02-27 14:21:41,526 cromwell-system-akka.dispatchers.backend-dispatcher-97 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:0:1]: Status change from - to WaitingForReturnCode
What is a cluster engine of Rice? Is it based on SLURM?
Its Farmshare. here is the last line of a run that did not finish ( even after increasing the memory) with stout and the metadata.JSON and input.JSON. Can you please let me know what do you think the issue is? Thanks!
INFO 2020-02-27 22:50:55 MarkDuplicates Tracking 12 as yet unmatched pairs. 12 records in RAM. INFO 2020-02-27 22:51:07 MarkDuplicates Read 12,000,000 records. Elapsed time: 00:06:36s. Time for last 1,000,000: 12s. Last read position: chr10:121,485,559 INFO 2020-02-27 22:51:07 MarkDuplicates Tracking 484 as yet unmatched pairs. 484 records in RAM. /bin/bash: line 1: 26648 Killed java -Xmx40000M -XX:ParallelGCThreads=1 -jar /home/baharehh/atac_dnase_pipelines/yes/envs/encode-atac-seq-pipeline/share/picard-2.20.7-0/picard.jar MarkDuplicates INPUT=hs_bh_s2.R1.trim.filt.bam OUTPUT=hs_bh_s2.R1.trim.dupmark.bam METRICS_FILE=hs_bh_s2.R1.trim.dup.qc VALIDATION_STRINGENCY=LENIENT USE_JDK_DEFLATER=TRUE USE_JDK_INFLATER=TRUE ASSUME_SORTED=true REMOVE_DUPLICATES=false STDOUT=
stdout.txt metadata.txt template.full-all-no-apator-allMemor-increase.txt call-alighn-stdout.txt
Did you run it on a login node? Then it should fail.
You have to be on a compute node (with enough memory like 40G that you requested) to run pipelines in a local mode (you did caper init local
right?).
oh because I increase some of the memory to 32 in some parts? Yes, i am in the login node and don't know the memory? "cat /proc/meminfo" gives below:
Shmem: 326752 kB Slab: 7024352 kB SReclaimable: 3580420 kB SUnreclaim: 3443932 kB KernelStack: 11648 kB PageTables: 67136 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 40343884 kB Committed_AS: 14258576 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7165312 kB DirectMap2M: 43157504 kB**
and "Free" gives below total used free shared buff/cache available Mem: 49443488 7450964 11716904 326724 30275620 37658480 Swap: 15622140 567800 15054344
How much memory I can increase in my system? seems like I can do up to 40G...?!
Thanks
One thing for sure is that your cluster Rice killed pipelines due to OOM (out-of-memory).
I don't know how your cluster Rice works so I don't have much advice for you here. Just don't run anything on a login node. Log-in on a compute node and run it. Ask your cluster admin how to do it.
caper run
will take up to 9 x atac.call_peak_mb
because it will run all call_peak tasks in parallel. So serialize it by using caper run .. --max-concurrent-tasks=1
. So your pipeline will take at most atac.call_peak_mb
memory per pipeline.
caper run .. --max-concurrent-tasks=1
sure will do. If I want to re-continue with metadata.json, why it gives this error ( I cd into the folder with this file and re-initiated the run correctly):
[Caper] Error (womtool): WDL or input JSON is invalid or input JSON doesn't exist.
That means your input JSON has wrong parameters or it doesn't exist.
But the output JSON file is coming from the pipeline itself, and now I have to somehow find a way to figure out where the pipeline`s error in its own output, so I can reinitiate the run?!
it's hard to fix.. metadata.txt
Bahareh
If pipeline fails, then caper run
automatically debug it (by parsing metadata.json
). You just need to see a few bottom lines of your screen to find the failure reason.
@Baharehh Just a note, Stanford's rice
server is a login node that, as Jin mentioned, is not intended for intensive compute tasks like the ENCODE pipelines. FarmShare2 uses a SLURM job submission system, so I think you should be able to log into rice
and follow ENCODE's instructions to initiate workflows via SLURM: https://github.com/ENCODE-DCC/caper#running-pipelines-on-slurm-clusters. From what I can tell from the FarmShare2 documentation (https://srcc.stanford.edu/farmshare2), using SLURM to submit jobs from rice
will actually use compute nodes on the wheat
server. If you continue to have issues, it may be productive to email research-computing-support@stanford.edu with your question and include "FarmShare2" in the subject line. Support may be limited because it is intended for Stanford coursework and unfunded research.
Hi Jin I was going to post my question and ask your help for multiple runs that are failing. but I cannot find metadata.json. after updating the chip and atac pipeline, out of 20 runs at Stanford rice, only 1 of each ran partially and as I mentioned I cannot find the medata in almost all of them.
can you please help .. thank you [
](url) B