broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 360 forks source link

AWS S3 copy source too large #4805

Closed vortexing closed 5 years ago

vortexing commented 5 years ago

Backend: AWS

Workflow: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-workflow.wdl First input json: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-parameters.json Second input json is LIKE this one, but refers to a batch of 100 input datasets: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-batchofOne.json

Config:
Installed the cromwell version in PR #4790.

Error:

        "callCaching": {
          "allowResultReuse": true,
          "hit": false,
          "result": "Cache Miss",
          "effectiveCallCachingMode": "ReadAndWriteCache",
          "hitFailures": [
            {
              "dd860da7-bed8-4e70-812c-227f4e6fead8:Panel_BWA_GATK4_Samtools_Var_Annotate_Split.SamToFastq:0": [
                {
                  "causedBy": [
                    {
                      "causedBy": [],
                      "message": "The specified copy source is larger than the maximum allowable size for a copy source: 5368709120 (Service: S3, Status Code: 400, Request ID: AE0D7E6A63C706E5)"
                    }
                  ],
                  "message": "[Attempted 1 time(s)] - S3Exception: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120 (Service: S3, Status Code: 400, Request ID: AE0D7E6A63C706E5)"
                }

This version of Cromwell does seem to successfully access and copy a cached file from a previous workflow at least on the first task in a shard. This workflow is essentially a batch in which each row of a batch file is passed to a shard and then the tasks run independently on each input dataset and they never gather. However, when the files get larger than the single test data set it seems it can't get to the previous file in order to determine if there's a hit.

dtenenba commented 5 years ago

Here is the config file used for the above.

include required(classpath("application"))

   "workflow_failure_mode": "ContinueWhilePossible"

webservice {
  port = 2525
}

system.file-hash-cache=true

system {
  job-rate-control {
    jobs = 1
    per = 2 second
  }
}

call-caching {
    enabled = true
    invalidate-bad-cache-results = true
}

database {
  profile = "slick.jdbc.MySQLProfile$"
  db {
#    driver = "com.mysql.jdbc.Driver"
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://xxxxxx:xxxx/xxx?rewriteBatchedStatements=true&useSSL=false"
    user = "xxx"
    password = "xxx"
    connectionTimeout = 120000
  }
}

aws {
  application-name = "cromwell"
  auths = [
    {
      name = "default"
      scheme = "default"
    }
    {
        name = "assume-role-based-on-another"
        scheme = "assume_role"
        base-auth = "default"
        role-arn = "arn:aws:iam::xx:role/xxx"
    }
  ]
  // diff 1:
  # region = "us-west-2" // uses region from ~/.aws/config set by aws configure command,
  #                    // or us-east-1 by default
}
engine {
  filesystems {
    s3 {
      auth = "assume-role-based-on-another"
    }
  }
}
backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        // Base bucket for workflow executions
        root = "s3://xxx/cromwell-output"
        // A reference to an auth defined in the `aws` stanza at the top.  This auth is used to create
        // Jobs and manipulate auth JSONs.
        auth = "default"
        // diff 2:
        numSubmitAttempts = 1
        // diff 3:
        numCreateDefinitionAttempts = 1
        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-west-2:xxx:job-queue/xxx"
        }
        filesystems {
          s3 {
            // A reference to a potentially different auth for manipulating files via engine functions.
            auth = "default"
          }
        }
      }
    }
  }
}
geoffjentry commented 5 years ago

It was noted by @dtenenba that this is likely caused by a need to use multipart upload when copying large files

danbills commented 5 years ago

closing this issue as #4828 covers this situation.