ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
385 stars 172 forks source link

cannot find metadata.json+ runs dont finish at rice #222

Closed Baharehh closed 4 years ago

Baharehh commented 4 years ago

Hi Jin I was going to post my question and ask your help for multiple runs that are failing. but I cannot find metadata.json. after updating the chip and atac pipeline, out of 20 runs at Stanford rice, only 1 of each ran partially and as I mentioned I cannot find the medata in almost all of them.

can you please help .. thank you [

](url) B

leepc12 commented 4 years ago

Change directory into your output directory (out-dir or CWD where you ran pipelines).

$ cd [YOUR_OUTPUT_DIR]
$ find -name "metadata.json"
Baharehh commented 4 years ago

of course, I did, but it doesn't find it. I also manually searched and cannot find it. I need to find it and post it for you so you can let me know why all the runes are unfinished. I attached my input JSON. This example file which ran with the old pipeline, now it doesn't run ...

test-tmuxPath-BH-iN-atac-s1s2-auto-adapto-true-remove-Jin2.txt

leepc12 commented 4 years ago

Please post a full screen log from caper run.

Baharehh commented 4 years ago

I am copy-pasting part of an output from an ongoing atac pipe run right now which seen to be halted.. I am also attaching metadata.JSON from a ChIP run that didn't ran with the new pipe. Thanks! metadata.txt

2020-02-27 07:16:50,343 cromwell-system-akka.dispatchers.backend-dispatcher-93 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: executing: /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-1/execution/script # if [ -z "$SINGULARITY_BINDPATH" ]; then export SINGULARITY_BINDPATH=; fi; if [ -z "$SINGULARITY_CACHEDIR" ]; then export SINGULARITY_CACHEDIR=; fi; singularity exec --cleanenv --home /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-1 /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-1/execution/script 2020-02-27 07:16:50,389 cromwell-system-akka.dispatchers.backend-dispatcher-96 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:0:1]: executing: /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-0/execution/script # if [ -z "$SINGULARITY_BINDPATH" ]; then export SINGULARITY_BINDPATH=; fi; if [ -z "$SINGULARITY_CACHEDIR" ]; then export SINGULARITY_CACHEDIR=; fi; singularity exec --cleanenv --home /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-0 /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/6ce55435-e762-4b44-8924-61cb6e8de264/call-align_mito/shard-0/execution/script 2020-02-27 07:16:52,533 cromwell-system-akka.dispatchers.backend-dispatcher-100 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: job id: 26212 2020-02-27 07:16:52,538 cromwell-system-akka.dispatchers.backend-dispatcher-93 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align:0:1]: job id: 26193 2020-02-27 07:16:52,773 cromwell-system-akka.dispatchers.backend-dispatcher-93 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: Status change from - to WaitingForReturnCode 2020-02-27 07:16:52,782 cromwell-system-akka.dispatchers.backend-dispatcher-94 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align:0:1]: Status change from - to WaitingForReturnCode 2020-02-27 07:16:57,521 cromwell-system-akka.dispatchers.backend-dispatcher-91 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:0:1]: job id: 26234 2020-02-27 07:16:58,004 cromwell-system-akka.dispatchers.backend-dispatcher-91 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.alignmito:0:1]: Status change from - to WaitingForReturnCode ^^_^_2020-02-27 09:15:51,700 cromwell-system-akka.dispatchers.backend-dispatcher-195 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:0:1]: Status change from WaitingForReturnCode to Done 2020-02-27 09:30:00,972 cromwell-system-akka.dispatchers.backend-dispatcher-196 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(6ce55435)atac.align_mito:1:1]: Status change from WaitingForReturnCode to Done

leepc12 commented 4 years ago

caper run automatically runs a troubleshooter before closing to parse metadata.json to find a reason for failure.

Your metadata.json says that

    "failures": [
        {
            "message": "Workflow failed",
            "causedBy": [
                {
                    "message": "Job chip.call_peak_pr1:1:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                    "causedBy": []
                },
                {
                    "message": "Job chip.call_peak_ppr1:NA:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                    "causedBy": []
                },
                {
                    "message": "Job chip.call_peak_pr2:0:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                    "causedBy": []
                },
                {
                    "message": "Job chip.macs2_signal_track_pooled:NA:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                    "causedBy": []
                },
                {
                    "causedBy": [],
                    "message": "Job chip.call_peak:0:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details."
                },
                {
                    "causedBy": [],
                    "message": "Job chip.call_peak:1:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details."
                }
            ]
        }
    ],

Look at the following STDERR files and check what happened.

macs2: /farmshare/user_data/baharehh/chip-seq-pipeline2/chip/90144a57-b1fc-4f25-a8cb-3885695d9eea/call-macs2_signal_track_pooled/execution/stderr

call-peak: /farmshare/user_data/baharehh/chip-seq-pipeline2/chip/90144a57-b1fc-4f25-a8cb-3885695d9eea/call-call_peak_ppr1/execution/stderr

Baharehh commented 4 years ago

I attach stderr from a run that failed and also progressed enough to make these 2 files- the above run didn't spit out those 2 files. stderr-2.txt stderr.txt

out those files.

please take a look. It doesn't give me a reason why it failed.

Baharehh commented 4 years ago

I think nth=4 is so small number and we could increase the core number? say up to 8? I know i have ~ 16 in my system ...?

leepc12 commented 4 years ago

No, increase memory instead of cpu (nth). Find default memory for the following variables in the input documentation. Try doubling them in your input JSON.

atac.call_peak_mem_mb
atac.macs2_signal_track_mem_mb
Baharehh commented 4 years ago

Thanks Jin, I fixed the memory issue, but the atac pipline still halted since 2 pm with below "partial" output ( the chip pipe also halted) :

2020-02-27 14:21:36,524 cromwell-system-akka.dispatchers.backend-dispatcher-101 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align:0:1]: job id: 4567 2020-02-27 14:21:36,524 cromwell-system-akka.dispatchers.backend-dispatcher-102 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:1:1]: job id: 4594 2020-02-27 14:21:36,626 cromwell-system-akka.dispatchers.backend-dispatcher-97 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:0:1]: executing: /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/0d355764-5dc3-4791-ac03-b4155b727957/call-align_mito/shard-0/execution/script # if [ -z "$SINGULARITY_BINDPATH" ]; then export SINGULARITY_BINDPATH=; fi; if [ -z "$SINGULARITY_CACHEDIR" ]; then export SINGULARITY_CACHEDIR=; fi; singularity exec --cleanenv --home /farmshare/user_data/baharehh/atac-seq-pipeline/atac/0d355764-5dc3-4791-ac03-b4155b727957/call-align_mito/shard-0 /bin/bash /farmshare/user_data/baharehh/atac-seq-pipeline/atac/0d355764-5dc3-4791-ac03-b4155b727957/call-align_mito/shard-0/execution/script 2020-02-27 14:21:36,692 cromwell-system-akka.dispatchers.backend-dispatcher-106 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:1:1]: Status change from - to WaitingForReturnCode 2020-02-27 14:21:36,692 cromwell-system-akka.dispatchers.backend-dispatcher-99 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align:0:1]: Status change from - to WaitingForReturnCode 2020-02-27 14:21:41,524 cromwell-system-akka.dispatchers.backend-dispatcher-106 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:0:1]: job id: 4627 2020-02-27 14:21:41,526 cromwell-system-akka.dispatchers.backend-dispatcher-97 INFO - BackgroundConfigAsyncJobExecutionActor [UUID(0d355764)atac.align_mito:0:1]: Status change from - to WaitingForReturnCode

leepc12 commented 4 years ago

What is a cluster engine of Rice? Is it based on SLURM?

Baharehh commented 4 years ago

Its Farmshare. here is the last line of a run that did not finish ( even after increasing the memory) with stout and the metadata.JSON and input.JSON. Can you please let me know what do you think the issue is? Thanks!

INFO 2020-02-27 22:50:55 MarkDuplicates Tracking 12 as yet unmatched pairs. 12 records in RAM. INFO 2020-02-27 22:51:07 MarkDuplicates Read 12,000,000 records. Elapsed time: 00:06:36s. Time for last 1,000,000: 12s. Last read position: chr10:121,485,559 INFO 2020-02-27 22:51:07 MarkDuplicates Tracking 484 as yet unmatched pairs. 484 records in RAM. /bin/bash: line 1: 26648 Killed java -Xmx40000M -XX:ParallelGCThreads=1 -jar /home/baharehh/atac_dnase_pipelines/yes/envs/encode-atac-seq-pipeline/share/picard-2.20.7-0/picard.jar MarkDuplicates INPUT=hs_bh_s2.R1.trim.filt.bam OUTPUT=hs_bh_s2.R1.trim.dupmark.bam METRICS_FILE=hs_bh_s2.R1.trim.dup.qc VALIDATION_STRINGENCY=LENIENT USE_JDK_DEFLATER=TRUE USE_JDK_INFLATER=TRUE ASSUME_SORTED=true REMOVE_DUPLICATES=false STDOUT=

stdout.txt metadata.txt template.full-all-no-apator-allMemor-increase.txt call-alighn-stdout.txt

leepc12 commented 4 years ago

Did you run it on a login node? Then it should fail.

You have to be on a compute node (with enough memory like 40G that you requested) to run pipelines in a local mode (you did caper init local right?).

Baharehh commented 4 years ago

oh because I increase some of the memory to 32 in some parts? Yes, i am in the login node and don't know the memory? "cat /proc/meminfo" gives below:

Shmem: 326752 kB Slab: 7024352 kB SReclaimable: 3580420 kB SUnreclaim: 3443932 kB KernelStack: 11648 kB PageTables: 67136 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 40343884 kB Committed_AS: 14258576 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7165312 kB DirectMap2M: 43157504 kB**

and "Free" gives below total used free shared buff/cache available Mem: 49443488 7450964 11716904 326724 30275620 37658480 Swap: 15622140 567800 15054344

How much memory I can increase in my system? seems like I can do up to 40G...?!

Thanks

leepc12 commented 4 years ago

One thing for sure is that your cluster Rice killed pipelines due to OOM (out-of-memory).

I don't know how your cluster Rice works so I don't have much advice for you here. Just don't run anything on a login node. Log-in on a compute node and run it. Ask your cluster admin how to do it.

caper run will take up to 9 x atac.call_peak_mb because it will run all call_peak tasks in parallel. So serialize it by using caper run .. --max-concurrent-tasks=1. So your pipeline will take at most atac.call_peak_mb memory per pipeline.

caper run .. --max-concurrent-tasks=1
Baharehh commented 4 years ago

sure will do. If I want to re-continue with metadata.json, why it gives this error ( I cd into the folder with this file and re-initiated the run correctly):

[Caper] Error (womtool): WDL or input JSON is invalid or input JSON doesn't exist.

leepc12 commented 4 years ago

That means your input JSON has wrong parameters or it doesn't exist.

Baharehh commented 4 years ago

But the output JSON file is coming from the pipeline itself, and now I have to somehow find a way to figure out where the pipeline`s error in its own output, so I can reinitiate the run?!

it's hard to fix.. metadata.txt

Bahareh

leepc12 commented 4 years ago

If pipeline fails, then caper run automatically debug it (by parsing metadata.json). You just need to see a few bottom lines of your screen to find the failure reason.

nicolerg commented 4 years ago

@Baharehh Just a note, Stanford's rice server is a login node that, as Jin mentioned, is not intended for intensive compute tasks like the ENCODE pipelines. FarmShare2 uses a SLURM job submission system, so I think you should be able to log into rice and follow ENCODE's instructions to initiate workflows via SLURM: https://github.com/ENCODE-DCC/caper#running-pipelines-on-slurm-clusters. From what I can tell from the FarmShare2 documentation (https://srcc.stanford.edu/farmshare2), using SLURM to submit jobs from rice will actually use compute nodes on the wheat server. If you continue to have issues, it may be productive to email research-computing-support@stanford.edu with your question and include "FarmShare2" in the subject line. Support may be limited because it is intended for Stanford coursework and unfunded research.