Closed fangpingmu closed 4 years ago
I am sorry about late response. Did you install/activate pipeline's Conda env before running a pipeline?
$ conda activate encode-atac-seq-pipeline
Hello @leepc12 ,
Thank you for building a robust debugger and powerful tool! I'm running into a similar error as @fangpingmu. Please note, I am working on the UCSF Wytnon HPC, which uses the Son Grid Engine (SGE) scheduler.
Here are my steps after cloning atac-seq pipeline v1.7 and caper v0.6.5 and downloading NCSR356KRQ_subsampled_caper.json
1. Initialize caper and set default.conf. For the purpose of debugging I've initialized this to a local environment.
$ caper init local
$ mkdir /wynton/home/bruneau/ablair/test
$ vim ~/.caper/default.conf
backend=local
tmp-dir=/wynton/home/bruneau/ablair/test
2. Execute build using singularity.
$ cd /wynton/home/bruneau/ablair/test
$ caper run /wynton/home/bruneau/ablair/atac-seq-pipeline/atac.wdl -i /wynton/home/bruneau/ablair/ENCSR356KRQ_subsampled_caper.json --singularity --no-build-singularity
3. Use caper to assess why build failed
$ cd /wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78
$ ls
call-align call-align_mito call-read_genome_tsv metadata.json
$ caper debug metadata.json
Found failures:
[
{
"message": "Workflow failed",
"causedBy": [
{
"causedBy": [
{
"causedBy": [],
"message": "Bad output 'align_mito.bam': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.bai': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.non_mito_samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.read_len_log': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.read_len': key not found: read_len_log"
}
],
"message": "Failed to evaluate job outputs"
},
{
"message": "Failed to evaluate job outputs",
"causedBy": [
{
"causedBy": [],
"message": "Bad output 'align_mito.bam': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.bai': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.non_mito_samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
"causedBy": [],
"message": "Bad output 'align_mito.read_len_log': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align_mito.read_len': key not found: read_len_log"
}
]
},
{
"message": "Failed to evaluate job outputs",
"causedBy": [
{
"message": "Bad output 'align.bam': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0",
"causedBy": []
},
{
"causedBy": [],
"message": "Bad output 'align.bai': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.non_mito_samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.read_len_log': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.read_len': key not found: read_len_log"
}
]
},
{
"causedBy": [
{
"message": "Bad output 'align.bam': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0",
"causedBy": []
},
{
"causedBy": [],
"message": "Bad output 'align.bai': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.non_mito_samstat_qc': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.read_len_log': Failed to find index Success(WomInteger(0)) on array:\n\nSuccess([])\n\n0"
},
{
"causedBy": [],
"message": "Bad output 'align.read_len': key not found: read_len_log"
}
],
"message": "Failed to evaluate job outputs"
}
]
}
]
atac.align_mito Failed. SHARD_IDX=0, RC=None, JOB_ID=32732, RUN_START=2020-02-24T20:19:13.754Z, RUN_END=2020-02-24T20:20:30.780Z, STDOUT=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align_mito/shard-0/execution/stdout, STDERR=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align_mito/shard-0/execution/stderr STDERR_CONTENTS=
atac.align_mito Failed. SHARD_IDX=1, RC=None, JOB_ID=398, RUN_START=2020-02-24T20:19:15.751Z, RUN_END=2020-02-24T20:21:03.867Z, STDOUT=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align_mito/shard-1/execution/stdout, STDERR=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align_mito/shard-1/execution/stderr STDERR_CONTENTS=
atac.align Failed. SHARD_IDX=0, RC=None, JOB_ID=32595, RUN_START=2020-02-24T20:19:09.782Z, RUN_END=2020-02-24T20:52:39.729Z, STDOUT=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-0/execution/stdout, STDERR=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-0/execution/stderr STDERR_CONTENTS=
atac.align Failed. SHARD_IDX=1, RC=None, JOB_ID=32698, RUN_START=2020-02-24T20:19:11.755Z, RUN_END=2020-02-24T20:52:43.579Z, STDOUT=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/stdout, STDERR=/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/stderr STDERR_CONTENTS=
4. Print out of the last atac.align failure's stderr.background report at: /wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/
ln: failed to create hard link '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/glob-3bcbe4e7489c90f75e0523ac6f3a9385/ENCFF641SFZ.subsampled.400.trim.merged.bam' => '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/ENCFF641SFZ.subsampled.400.trim.merged.bam': Operation not permitted ln: failed to create hard link '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/glob-6efbc60cb1e0959bab4e467327a9416c/ENCFF641SFZ.subsampled.400.trim.merged.bam.bai' => '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/ENCFF641SFZ.subsampled.400.trim.merged.bam.bai': Operation not permitted ln: failed to create hard link '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/glob-7b38d9959cf6f3deb83ac2bd156d8317/ENCFF641SFZ.subsampled.400.trim.merged.samstats.qc' => '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/ENCFF641SFZ.subsampled.400.trim.merged.samstats.qc': Operation not permitted ln: failed to create hard link '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/glob-bc1afa799665df5c7d6afd70d2ae2cb4/ENCFF641SFZ.subsampled.400.trim.merged.no_chrM.samstats.qc' => '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/non_mito/ENCFF641SFZ.subsampled.400.trim.merged.no_chrM.samstats.qc': Operation not permitted ln: failed to create hard link '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/glob-773fb92850749a2b4a829cf3c8c4de27/ENCFF641SFZ.subsampled.400.trim.merged.read_length.txt' => '/wynton/home/bruneau/ablair/test/atac/2269c24e-2237-47be-a05b-20213ef0fc78/call-align/shard-1/execution/ENCFF641SFZ.subsampled.400.trim.merged.read_length.txt': Operation not permitted
I'll try running this again with the background.conf configured with sge and let you know if I receive a similar report. If the hard link error persists is there a way for us to set soft links?
Many thanks, Andrew
Hi @leepc12,
Here is my update after configuring caper to run on SGE.
1. Update caper default.conf
$ caper init sge
$ qconf -spl
mpi mpi-8 mpi_onehost smp
$ vim ~/.caper/default.conf
backend=sge
sge-pe=mpi
tmp-dir=/wynton/home/bruneau/ablair/test
2. Execute build using singularity caper run /wynton/home/bruneau/ablair/atac-seq-pipeline/atac.wdl -i /wynton/home/bruneau/ablair/ENCSR356KRQ_subsampled_caper.json --singularity --no-build-singularity
3. Print out of caper's debug
Found failures: [ { "message": "Workflow failed", "causedBy": [ { "message": "java.lang.RuntimeException: Could not find job ID from stdout file.Check the stderr file for possible errors: /wynton/home/bruneau/ablair/test/atac/7ba75f43-9abb-4202-a8dc-56396251d317/call-read_genome_tsv/execution/stderr.submit", "causedBy": [ { "message": "Could not find job ID from stdout file.Check the stderr file for possible errors: /wynton/home/bruneau/ablair/test/atac/7ba75f43-9abb-4202-a8dc-56396251d317/call-read_genome_tsv/execution/stderr.submit", "causedBy": [] } ] } ] } ]
atac.read_genome_tsv Failed. SHARD_IDX=-1, RC=None, JOB_ID=None, RUN_START=2020-02-24T22:51:43.015Z, RUN_END=2020-02-24T22:51:45.800Z, STDOUT=/wynton/home/bruneau/ablair/test/atac/7ba75f43-9abb-4202-a8dc-56396251d317/call-read_genome_tsv/execution/stdout, STDERR=/wynton/home/bruneau/ablair/test/atac/7ba75f43-9abb-4202-a8dc-56396251d317/call-read_genome_tsv/execution/stderr STDERR_CONTENTS= time="2020-02-24T14:52:12-08:00" level=warning msg="\"/run/user/35073\" directory set by $XDG_RUNTIME_DIR does not exist. Either create the directory or unset $XDG_RUNTIME_DIR.: stat /run/user/35073: no such file or directory: Trying to pull image in the event that it is a public image." ESC[31mFATAL: ESC[0m Unable to handle docker://quay.io/encode-dcc/atac-seq-pipeline:v1.7.0 uri: failed to get SHA of docker://quay.io/encode-dcc/atac-seq-pipeline:v1.7.0: pinging docker registry returned: Get https://quay.io/v2/: proxyconnect tcp: dial tcp 172.19.0.250:80: i/o timeout
4. Print out of the atac.read_genome_tsv failure's stderr report at /wynton/home/bruneau/ablair/test/atac/7ba75f43-9abb-4202-a8dc-56396251d317/call-read_genome_tsv/execution
time="2020-02-24T14:52:12-08:00" level=warning msg="\"/run/user/35073\" directory set by $XDG_RUNTIME_DIR does not exist. Either create the directory or unset $XDG_RUNTIME_DIR.: stat /run/user/35073: no such file or directory: Trying to pull image in the event that it is a public image." FATAL: Unable to handle docker://quay.io/encode-dcc/atac-seq-pipeline:v1.7.0 uri: failed to get SHA of docker://quay.io/encode-dcc/atac-seq-pipeline:v1.7.0: pinging docker registry returned: Get https://quay.io/v2/: proxyconnect tcp: dial tcp 172.19.0.250:80: i/o timeout
I also tried submitting to one of our compute nodes. Here are the steps and error messages:
1. Submit job specifying the build to use singularity
echo "caper run /wynton/home/bruneau/ablair/atac-seq-pipeline/atac.wdl -i /wynton/home/bruneau/ablair/ENCSR356KRQ_subsampled_caper.json --singularity --no-build-singularity" | qsub -V -N test -l h_rt=01:00:00 -l mem_free=2G -l eth_speed=20
2. View error report
Traceback (most recent call last):
File "/wynton/home/bruneau/ablair/.local/bin/caper", line 13, in
Exception: cURL RC: 7, HTTP_ERR: 0, STDERR: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:06 --:--:-- 0curl: (7) Failed connect to 172.19.0.250:80; Connection timed out
Please let me know if you require anymore information.
Thanks, Andrew
I think there are two failures.
1) Caper failed to fetch a docker image (to build a singularity image from it) from quay.io
.
Unable to handle docker://quay.io/encode-dcc/atac-seq-pipeline:v1.7.0
Define a singularity image on dockerhub
instead: --singularity docker://encodedcc/atac-seq-pipeline:v1.7.0
. BTW Does your cluster allow internet connection on compute/login nodes?
If this fails then try with a Conda method.
2) /run/user/35073" directory set by $XDG_RUNTIME_DIR does not exist. Either create the directory or unset $XDG_RUNTIME_DIR
What does this error message mean?
This is the issue with hardlinks in cromwell's glob outputs.
A solution is provided in https://github.com/ENCODE-DCC/chip-seq-pipeline2/issues/91
I was able to resolve the problem by downloading the cromwell source code, modifying the globLinkCommand to use soft instead of hard links: .getOrElse("( ln -sL GLOB_PATTERN GLOB_DIRECTORY 2> /dev/null ) || ( ln -s GLOB_PATTERN GLOB_DIRECTORY )") and building a new cromwell jar file as per their instructions. After updating ~/.caper/default.conf to use this newly built jar, the workflow proceeded as expected.
I do recommend that caper add an option to change hardlink to soft link.
We have lost interests in Cromwell pipelines. We know that AWS and GCP requires hard link. However, the hard link does not work under most HPC POSIX file systems. Many people have reported this problem to Cromwell developers. The Cromwell community never care these complains.
@fangpingmu: I think Cromwell has a parameter glob-link-command
to control it.
https://github.com/broadinstitute/cromwell/pull/5250
I can add it to Caper's next release or you can make a backend file to override Caper's built-it backends.
caper run/server --backend-file your.backend.conf
your.backend.conf
should look like. Define it for any backend you want.
backend {
providers {
Local {
config {
glob-link-command = "ln -sL GLOB_PATTERN GLOB_DIRECTORY"
}
}
sge {
config {
glob-link-command = "ln -sL GLOB_PATTERN GLOB_DIRECTORY"
}
}
slurm {
config {
glob-link-command = "ln -sL GLOB_PATTERN GLOB_DIRECTORY"
}
}
}
}
Thanks for the fast responses. Regarding each reply:
I don't believe our compute node has internet connection but our dev node does, which is where I ran the local build. I'll reach out to our admin to confirm the compute node internet access.
I will also have to ask our admin regarding the $XDG_RUNTIME_DIR
error. I can let you know what they say but I suspect this is a user group permission setting.
How long does caper run/server --backend-file your.backend.conf
usually run? I defined my backend conf for sge and it's been running since ~4pm yesterday.
Thanks, Andrew
Running caper init [YOUR_PLATFORM]
on dev nodes will init Caper's conf file and also download Cromwell and Womtool locally so your pipeline can work offline later on.
Okay.
It depends on the size of your data and your cluster's node resource/availability. For big samples, it can take > 1 day on my cluster (Stanford Sherlock).
Describe the bug I tried to run atac-seq-pipeline using example json file at https://raw.githubusercontent.com/ENCODE-DCC/atac-seq-pipeline/master/example_input_json/caper/ENCSR356KRQ_subsampled_caper.json
The log file is long and the first error appears as,
2020-01-13 17:13:51,325 cromwell-system-akka.dispatchers.engine-dispatcher-43 INFO - WorkflowManagerActor Workflow 667052e2-f822-4a44-86dd-2ad86bf348c7 failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs: Bad output 'align_mito.bam': Failed to find index Success(WomInteger(0)) on array: Success([]) 0 Bad output 'align_mito.bai': Failed to find index Success(WomInteger(0)) on array: Success([]) 0 Bad output 'align_mito.samstat_qc': Failed to find index Success(WomInteger(0)) on array: Success([]) 0 Bad output 'align_mito.non_mito_samstat_qc': Failed to find index Success(WomInteger(0)) on array: Success([]) 0 Bad output 'align_mito.read_len_log': Failed to find index Success(WomInteger(0)) on array: Success([]) 0 Bad output 'align_mito.read_len': key not found: read_len_log at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:916) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs: Bad output 'align_mito.bam': Failed to find index Success(WomInteger(0)) on array: at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:916) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
OS/Platform
Caper configuration file [defaults]
Input JSON file https://raw.githubusercontent.com/ENCODE-DCC/atac-seq-pipeline/master/example_input_json/caper/ENCSR356KRQ_subsampled_caper.json
Error log Caper automatically runs a troubleshooter for failed workflows. If it doesn't then get a
WORKFLOW_ID
of your failed workflow withcaper list
or directly use ametadata.json
file on Caper's output directory.