broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
997 stars 361 forks source link

Permission denied on AWS Batch workflow #4542

Open kmavrommatis opened 5 years ago

kmavrommatis commented 5 years ago

Hi, I am trying to run a workflow on AWS Batch using the genomics-ami. The ami was built following the instructions in the relevant pages and i have confirmed that it contains a /cromwell-root mount point and has rw access to the bucket we use. The AWS batch backpoint was tested with the hello.wdl workflow and it went through.

When running the workflow on the local filesystem it completes without errors but when running it using the AWS Batch backend the first step fails with the following error:

2019-01-11 20:27:06,80] [error] WorkflowManagerActor Workflow 8fa7a9e4-f30d-4c19-b8cb-68be6442f317 failed (during ExecutingWorkflowState): cromwell.engine.io.IoAttempts$EnhancedCromwellIoException: [Attempted 1 time(s)] - IOException: Could not read from s3://bucket/cwl_temp_file_8fa7a9e4-f30d-4c19-b8cb-68be6442f317.cwl/8fa7a9e4-f30d-4c19-b8cb-68be6442f317/call-bbmap/bbmap-rc.txt: s3://s3.amazonaws.com/bucket/cwl_temp_file_8fa7a9e4-f30d-4c19-b8cb-68be6442f317.cwl/8fa7a9e4-f30d-4c19-b8cb-68be6442f317/call-bbmap/bbmap-rc.txt
Caused by: java.io.IOException: Could not read from s3://bucket/cwl_temp_file_8fa7a9e4-f30d-4c19-b8cb-68be6442f317.cwl/8fa7a9e4-f30d-4c19-b8cb-68be6442f317/call-bbmap/bbmap-rc.txt: s3://s3.amazonaws.com/bucket/cwl_temp_file_8fa7a9e4-f30d-4c19-b8cb-68be6442f317.cwl/8fa7a9e4-f30d-4c19-b8cb-68be6442f317/call-bbmap/bbmap-rc.txt
    at cromwell.engine.io.nio.NioFlow$$anonfun$withReader$2.applyOrElse(NioFlow.scala:146)
    at cromwell.engine.io.nio.NioFlow$$anonfun$withReader$2.applyOrElse(NioFlow.scala:145)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
    at scala.util.Failure.recoverWith(Try.scala:232)
    at cromwell.engine.io.nio.NioFlow.withReader(NioFlow.scala:145)
    at cromwell.engine.io.nio.NioFlow.limitFileContent(NioFlow.scala:154)
    at cromwell.engine.io.nio.NioFlow.$anonfun$readAsString$1(NioFlow.scala:98)
    at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:85)
    at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:336)
    at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:357)
    at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:303)
    at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.nio.file.NoSuchFileException: s3://s3.amazonaws.com/celgene-rnd-riku-researchanalytics/cwl_temp_file_8fa7a9e4-f30d-4c19-b8cb-68be6442f317.cwl/8fa7a9e4-f30d-4c19-b8cb-68be6442f317/call-bbmap/bbmap-rc.txt
    at org.lerch.s3fs.S3FileSystemProvider.newInputStream(S3FileSystemProvider.java:350)
    at java.nio.file.Files.newInputStream(Files.java:152)
    at better.files.File.newInputStream(File.scala:337)
    at cromwell.core.path.BetterFileMethods.newInputStream(BetterFileMethods.scala:240)
    at cromwell.core.path.BetterFileMethods.newInputStream$(BetterFileMethods.scala:239)
    at cromwell.filesystems.s3.S3Path.newInputStream(S3PathBuilder.scala:156)
    at cromwell.core.path.EvenBetterPathMethods.mediaInputStream(EvenBetterPathMethods.scala:94)
    at cromwell.core.path.EvenBetterPathMethods.mediaInputStream$(EvenBetterPathMethods.scala:91)
    at cromwell.filesystems.s3.S3Path.mediaInputStream(S3PathBuilder.scala:156)
    at cromwell.engine.io.nio.NioFlow.$anonfun$withReader$1(NioFlow.scala:145)
    at cromwell.util.TryWithResource$.$anonfun$tryWithResource$1(TryWithResource.scala:14)
    at scala.util.Try$.apply(Try.scala:209)
    at cromwell.util.TryWithResource$.tryWithResource(TryWithResource.scala:10)
    ... 14 more

[2019-01-11 20:27:06,80] [info] WorkflowManagerActor WorkflowActor-8fa7a9e4-f30d-4c19-b8cb-68be6442f317 is in a terminal state: WorkflowFailedState

Looking at the cloudwatch logs it appears that the problem is with permission on the node


04:26:11
mkdir: cannot create directory '/cromwell_root/bucket/cwl_temp_file_8fa7a9e4-f30d-4c19-b8cb-68be6442f317.cwl': Permission denied

04:26:11
chmod: cannot access '': No such file or directory

04:26:11
mkfifo: cannot create fifo '/out.1': Permission denied

04:26:11
mkfifo: cannot create fifo '/err.1': Permission denied

04:26:11
/bin/bash: line 15: /out.1: No such file or directory

04:26:11
/bin/bash: line 16: /err.1: No such file or directory

04:26:11
/bin/bash: line 22: /out.1: Permission denied

04:26:11
/bin/bash: line 23: /cromwell_root/bbmap-rc.txt.tmp: Permission denied

04:26:11
mkdir: cannot create directory '/cromwell_root/glob-2e18d4d3f934d19c17412db2b66b70fa': Permission denied

04:26:11
/bin/bash: line 38: /cromwell_root/glob-2e18d4d3f934d19c17412db2b66b70fa/cromwell_glob_control_file: No such file or directory

04:26:11
ln: failed to access '/cromwell_root/*R?.fq.gz': No such file or directory

04:26:11
/bin/bash: line 44: /cromwell_root/glob-2e18d4d3f934d19c17412db2b66b70fa.list: Permission denied

04:26:11
ls: cannot access '/cromwell_root/glob-2e18d4d3f934d19c17412db2b66b70fa': No such file or directory

04:26:11
mkdir: cannot create directory '/cromwell_root/glob-560912e697c3494360223c7ca65aa3e8': Permission denied

04:26:11
/bin/bash: line 52: /cromwell_root/glob-560912e697c3494360223c7ca65aa3e8/cromwell_glob_control_file: No such file or directory

04:26:11
ln: failed to access '/cromwell_root/*.qcstats': No such file or directory

04:26:11
/bin/bash: line 58: /cromwell_root/glob-560912e697c3494360223c7ca65aa3e8.list: Permission denied

04:26:11
ls: cannot access '/cromwell_root/glob-560912e697c3494360223c7ca65aa3e8': No such file or directory

04:26:11
mkdir: cannot create directory '/cromwell_root/glob-b34dfc006a981a93d6da067cf50036fe': Permission denied

04:26:11
/bin/bash: line 66: /cromwell_root/glob-b34dfc006a981a93d6da067cf50036fe/cromwell_glob_control_file: No such file or directory

04:26:11
ln: failed to access '/cromwell_root/cwl.output.json': No such file or directory

04:26:11
/bin/bash: line 72: /cromwell_root/glob-b34dfc006a981a93d6da067cf50036fe.list: Permission denied

04:26:11
ls: cannot access '/cromwell_root/glob-b34dfc006a981a93d6da067cf50036fe': No such file or directory

04:26:11
mv: cannot stat '/cromwell_root/bbmap-rc.txt.tmp': No such file or directory

04:26:11
MIME-Version: 1.0

04:26:11
Content-Type: multipart/alternative; boundary=278185423cec5467d351ab751807c36a

04:26:11
--278185423cec5467d351ab751807c36a

04:26:11
Content-Type: text/plain

04:26:11
Content-Disposition: attachment; filename=rc.txt

04:26:11
cat: /cromwell_root/bbmap-rc.txt: No such file or directory

04:26:11
--278185423cec5467d351ab751807c36a

04:26:11
Content-Type: text/plain

04:26:11
Content-Disposition: attachment; filename=stdout.txt

04:26:11
cat: /cromwell_root/bbmap-stdout.log: No such file or directory

04:26:11
--278185423cec5467d351ab751807c36a

04:26:11
Content-Type: text/plain

04:26:11
Content-Disposition: attachment; filename=stderr.txt

04:26:11
cat: /cromwell_root/DA0000317_WSU-DLCL.qcstats: No such file or directory

04:26:11
--278185423cec5467d351ab751807c36a--

04:26:11
cat: /cromwell_root/bbmap-rc.txt: No such file or directory

04:26:11
rm: cannot remove '/out.1': No such file or directory

04:26:11
rm: cannot remove '/err.1': No such file or directory

I have also tried to login to the node, and explicitly specify 777 permissions to /cromwell-root but the result was the same. Are there any specific considerations regarding the docker image or any additional configuration required?

Thanks in advance for your help

geoffjentry commented 5 years ago

Can you see if an equivalent WDL runs in the AWS backend? CWL is not fully supported on AWS and my guess is you’ve run into one of the places where it doesn’t work at the moment

geoffjentry commented 5 years ago

And by equivalent WDL I just meant something minimal to replicate the same idea of the task at hand, not porting your full workflow. It’d be handy to ascertain if it is a general backend problem or hitting into CWL issues.

kmavrommatis commented 5 years ago

@geoffjentry Thanks for the quick response and the suggestion.

I wrote a quick wdl script, that runs on cromwell + local backend, but when deployed on the AWS batch backend it fails in the same way.. The script exampleWorkflows/bbmap.wdl is:

task bbmaptask {
  File f1
  File f2
  command {
    reformat.sh maxcalledquality=40 in=${f1} in2=${f2} out=${f1}.ref.fq.gz out2=${f2}.ref.fq.gz
  }
  output {
    Array[File] response = glob("*R?.ref.fq.gz")
  }
  runtime {
    docker: "*********.dkr.ecr.us-east-1.amazonaws.com/ngs/bbmap:v37.64"
  }

}

workflow bbmapwf{
    call bbmaptask
}

The input file exampleInput/fastq.s3.wdl.json

{
"bbmapwf.bbmaptask.f1": "s3://bucket/fastq.20180820-150001/DA0000317_WSU-DLCL_R_02_01_02_S36_R1_001.fastq.gz",
"bbmapwf.bbmaptask.f2": "s3://bucket/fastq.20180820-150001/DA0000317_WSU-DLCL_R_02_01_02_S36_R2_001.fastq.gz"
}

I run cromwell as:

java -Dconfig.file=awsbatch/aws.conf -jar cromwell-36.jar run -i exampleInput/fastq.s3.wdl.json exampleWorkflows/bbmap.wdl

Thanks for your help

wleepang commented 5 years ago

@kmavrommatis - Curious how your AWS Batch environment was setup. Did you use the Cfn templates provided here, or build it manually?

It is important that the job instance profile associated with the compute environment has the correct access permissions.

kmavrommatis commented 5 years ago

Thanks, changing the environment setup using your suggested Cfn template resolved the problem.

medcelerate commented 5 years ago

Anyone have a concrete solution to this, we are also getting the permission denied error with our aws batch setup. We have even included chmod 777 in the cloud init script to ensure that directory is accessible.