broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
972 stars 354 forks source link

AWS S3: Can't access data outside my region (Status Code: 301) #4731

Open illusional opened 5 years ago

illusional commented 5 years ago

Hi!

I'm having some trouble request s3 objects that are outside my current region (I get a Status Code: 301).

Backend: AWS Batch Filesystem: S3 Region : ap-southeast-2

I'm attempting to run a small genomics pipeline that is trying to request some of the broad-reference open data set on AWS S3. I can see that open data set exists in us-east-1.

Specifically, I'm requesting (s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta) and I'm receiving the same error 5 times.

[2019-03-12 11:27:21,50] [error] WorkflowManagerActor Workflow 434834fb-cb24-4bd2-ba44-8a1c929b11f5 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
[Attempted 1 time(s)] - S3Exception: null (Service: S3Client; Status Code: 301; Request ID: null)
[Attempted 1 time(s)] - S3Exception: null (Service: S3Client; Status Code: 301; Request ID: null)
[Attempted 1 time(s)] - S3Exception: null (Service: S3Client; Status Code: 301; Request ID: null)
[Attempted 1 time(s)] - S3Exception: null (Service: S3Client; Status Code: 301; Request ID: null)
[Attempted 1 time(s)] - S3Exception: null (Service: S3Client; Status Code: 301; Request ID: null)

    at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFailed(MaterializeWorkflowDescriptorActor.scala:217)
    at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:187)
    at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:182)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
    at akka.actor.FSM.processEvent(FSM.scala:684)
    at akka.actor.FSM.processEvent$(FSM.scala:681)
    at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.akka$actor$LoggingFSM$$super$processEvent(MaterializeWorkflowDescriptorActor.scala:138)
    at akka.actor.LoggingFSM.processEvent(FSM.scala:820)
    at akka.actor.LoggingFSM.processEvent$(FSM.scala:802)
    at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.processEvent(MaterializeWorkflowDescriptorActor.scala:138)
    at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:678)
    at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:672)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.aroundReceive(MaterializeWorkflowDescriptorActor.scala:138)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
    at akka.actor.ActorCell.invoke(ActorCell.scala:557)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I'm basically using the standard aws configuration file for Cromwell:

include required(classpath("application"))

aws {
  application-name = "cromwell"
  auths = [{
      name = "default"
      scheme = "default"
  }]
  region = "ap-southeast-2"
}

engine { filesystems { s3 { auth = "default" } } }

backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        numSubmitAttempts = 3
        numCreateDefinitionAttempts = 3
        root = "s3://$bucketName/cromwell-execution"
        auth = "default"
        concurrent-job-limit = 16
        default-runtime-attributes {
          queueArn = "arn:aws:batch:ap-southeast-2:$arn"
        }
        filesystems { s3 { auth = "default" } }
      }
    }
  }
}

I've contacted AWS Support, to find out if I could fully (region) qualify the S3 locator (something like these examples: s3://us-east-1.amazonaws.com/broad-references/.../file.

AWS basically said no, and they directed me towards https://github.com/aws/aws-sdk-java/issues/1366 (their aws-sdk-java) with an enableForceGlobalBucketAccess option on a AmazonS3Builder.

I've tried to have a search through Cromwell to work out where this setting could be placed, but I'm a bit lost with project structure and Scala.

danbills commented 5 years ago

Brain dumping what I learned from Emil: This localization code is in the proxy. Probably needs to use a force global flag as in the Java SDK.

Looking forward, I also see an issue w/ call caching to files outside the compute region, as the filesystem copy is not using the force global flag.

danbills commented 5 years ago

Unable to reproduce, if you could post your CWL it'd be appreciated.

mihirsamdarshi commented 2 years ago

Hi, I this is might be a little late, but I am having this issue too when running using Batch. I configured my core environment on my own (without using the CF templates). I have a bucket that is located in us-west-2 and the instance running Cromwell (v59), and the Job Queue are located in us-east-2. When I run a job, I get the same error that @illusional was getting.