broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 357 forks source link

call caching consistently disabled (AWS Batch back end) #4563

Closed dtenenba closed 5 years ago

dtenenba commented 5 years ago

Backend: AWS Batch

Workflow: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/frankenstein.wdl

Input file: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/map-variantcall-hg38.json

Possibly related to #4412 but not sure as I don't see the same error message.

When submitting a workflow via the cromwell server we consistently see a failure to hash some items in S3 resulting in call caching being disabled for the run. We have seen this for a number of workflows, here we are including just one.

Call caching is a hugely important feature for us and if it is not available we may would have to reconsider using Cromwell.

I think I have discussed with @ruchim the fact that all objects in S3 have a hash already computed (the ETag header) so there should not be timeouts in computing these hashes as they are available with a head request (you don't need to download the whole object).

Error message (extract from /metadata output):

     "callCaching": {
          "hashFailures": [
            {
              "causedBy": [],
              "message": "Hashing request timed out for: s3://bucketname/cromwell-tests/Panel_BWA_GATK4_Samtools_Var_Annotate/162c863f-c22a-4b7c-bb37-f5195b329b36/call-ApplyBQSR/shard-0/smallTestData.hg38.recal.bam"
            }
          ],
          "allowResultReuse": false,
          "hit": false,
          "result": "Cache Miss",
          "effectiveCallCachingMode": "CallCachingOff"
        },

Config file:

include required(classpath("application"))

call-caching {
    enabled = true
    invalidate-bad-cache-results = true
}

database {
  # Store metadata in a file on disk that can grow much larger than RAM limits.
    profile = "slick.jdbc.HsqldbProfile$"
    db {
      driver = "org.hsqldb.jdbcDriver"
      url = "jdbc:hsqldb:file:aws-database;shutdown=false;hsqldb.tx=mvcc"
      connectionTimeout = 3000
    }
}

aws {
  application-name = "cromwell"
  auths = [
    {
      name = "default"
      scheme = "default"
    }
    {
        name = "assume-role-based-on-another"
        scheme = "assume_role"
        base-auth = "default"
        role-arn = "arn:aws:iam::xxx:role/fbucketname"
    }
  ]
  // diff 1:
  # region = "us-west-2" // uses region from ~/.aws/config set by aws configure command,
  #                    // or us-east-1 by default
}
engine {
  filesystems {
    s3 {
      auth = "assume-role-based-on-another"
    }
  }
}
backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        // Base bucket for workflow executions
        root = "s3://bucketname/cromwell-tests"
        // A reference to an auth defined in the `aws` stanza at the top.  This auth is used to create
        // Jobs and manipulate auth JSONs.
        auth = "default"
        // diff 2:
        numSubmitAttempts = 1
        // diff 3:
        numCreateDefinitionAttempts = 1
        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-west-2:xxx:job-queue/GenomicsHighPriorityQue-xxx"
        }
        filesystems {
          s3 {
            // A reference to a potentially different auth for manipulating files via engine functions.
            auth = "default"
          }
        }
      }
    }
  }
}
danbills commented 5 years ago

Hi @dtenenba , we fully appreciate the importance of call caching and are looking into this.

can I confirm a few things:

This will help me narrow down what is going on.

To be clear, this should be working and we are aware that hashing is not a manual process but a simple value lookup.

vortexing commented 5 years ago

I have run: Workflow: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-workflow.wdl Inputs: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-inputs.json

Now three times directly with the same input data and every single time for every single task (so the file that is the result of the first task from a previous run of this workflow does not get reused for the second task fo the current run of the workflow, and so on for all the tasks in the entire workflow) I have gotten this:

"callCaching": {
          "allowResultReuse": true,
          "hit": false,
          "result": "Cache Miss",
          "hashes": {
            "output count": "...",
            "runtime attribute": {
              "docker": "...",
              "continueOnReturnCode": "...",
              "failOnStderr": "..."
            },
            "output expression": {
              "File output_fastq": ".."
            },
            "input count": "..",
            "backend name": "...",
            "command template": "...",
            "input": {
              "String base_file_name": "...",
              "File input_bam": "..."
            }
          },
          "effectiveCallCachingMode": "ReadAndWriteCache"

So it's not timing out anymore (I replaced hashes with '...'), but never, ever having a "hit": true.

vortexing commented 5 years ago

I'm also not sure what a "non-trivial size" is, but this test data/workflow isn't a Hello World workflow, but also isn't a 15 step, 3 whole genomes variant calling workflow either. I have never, in my testing, ever seen a "hit": true. Perhaps someone else here has.

vortexing commented 5 years ago

Also I just compared two of the workflow runs:

{
  "callA": {
    "executionStatus": "Done",
    "allowResultReuse": false,
    "callFqn": "Panel_BWA_GATK4_Samtools_Var_Annotate_Split.BwaMem",
    "jobIndex": 0,
    "workflowId": "X"
  },
  "callB": {
    "executionStatus": "Done",
    "allowResultReuse": true,
    "callFqn": "Panel_BWA_GATK4_Samtools_Var_Annotate_Split.BwaMem",
    "jobIndex": 0,
    "workflowId": "Y"
  },
  "hashDifferential": []
}

???? But yet no data reuse.

danbills commented 5 years ago

Thanks for your help, this is great info. I see a part of our hashing code that is likely causing an issue here, will report back with more early next week.

vortexing commented 5 years ago

This might be conflating two issues, but in case it is related, another error we are consistently seeing that seems dependent on which docker container we use (which may be a red herring, but that's all I can narrow it down to), we'll run something and get this error:

"callCaching": {
          "hashFailures": [
            {
              "message": "[Attempted 1 time(s)] - NoSuchFileException: s3://s3.amazonaws.com/some-bucket/cromwell-tests/Kraken2_test_input.fastq.gz",
              "causedBy": [
                {
                  "causedBy": [],
                  "message": "s3://s3.amazonaws.com/some-bucket/cromwell-tests/Kraken2_test_input.fastq.gz"
                }
              ]
            }
          ],

Meanwhile, the input location of the input file is this:

 "inputs": {
          "input_fastq": {
            "format": null,
            "location": "s3://some-bucket/cromwell-tests/Kraken2_test_input.fastq.gz",
            "size": null,
            "secondaryFiles": [],
            "contents": null,
            "checksum": null,
            "class": "File"
          },

So it's being given a valid S3 URL but then when it's trying to get the hash, it's looking at an invalid S3 URL (the one with s3.amazonaws.com isn't valid, but wasn't supplied by us). Thoughts? Is this a separate issue?

danbills commented 5 years ago

@vortexing This is definitely related and is part of the code I'm addressing.

danbills commented 5 years ago

I'm caught in a release cycle but believe the relevant code belongs here for others watching:

https://github.com/broadinstitute/cromwell/blob/fc43c6954eb47912930e4c28f73781a112fdfbf8/engine/src/main/scala/cromwell/engine/io/nio/NioFlow.scala#L109

vortexing commented 5 years ago

New thoughts/additional info.

I just tried a workflow that evaluates all the way to the end successfully (and had been run before on the same data and no, call caching didn't happen, as is expected with AWS backend), with the exception of adding workflow options to specify the output and logs directories for the final results.

Interestingly enough the new prefixes were generated but files were not transferred over EXCEPT for the log. The workflow log transferred just fine. In the log there appears to be no errors or indication that the intended outputs were not successfully transferred over.

I'm looking at the workflow status, and while all the files were made correctly (so all tasks completed successfully), but the workflow as a whole failed b/c it knows it failed to transfer over the output data. However again, there are no errors indicated in the metadata indicating why no files were copied.

I'm wondering if this too would be expected to be a hashing failure? Are the identities of the files created that are intended as outputs defined by the hashing? Would this behavior be expected given the current issues with call caching? Or is this a new issue?

TimurIs commented 5 years ago

When are you going to fix it? It is quite hard to work on AWS without the call caching working properly...

vortexing commented 5 years ago

This is prohibitive for us to make progress using AWS. Between call caching not working AND the inability of Cromwell to stage input and output data from S3, this breaks it all. @wleepang is this data staging issue partly something AWS can provide some guidance on? It seems specific to AWS as the backend.

wleepang commented 5 years ago

@vortexing - task input and output data staging is handled by the ecs-proxy container that is installed when you create a custom AMI with "cromwell" settings. If you are not seeing data move in/out a good place to check for errors is the Cloudwatch log for a task that didn't have it's data staged correctly. Append -proxy to the job's cloudwatch log url to get the logging generated by the ecs-proxy.

dtenenba commented 5 years ago

Hi @wleepang , I looked at the -proxy logs for one of the jobs that failed with this error and saw (among other things):

download failed: s3://fh-ctr-public-reference-data/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam to ../cromwell_root/fh-ctr-public-reference-data/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam [Errno 28] No space left on device

So it seems like maybe this is a scratch space issue? I thought that Cromwell/AWS batch just automatically created more scratch space when it was needed, but that seems to not be happening. Any suggestions for troubleshooting the problem?

Thanks.

wleepang commented 5 years ago

@dtenenba - the space on the scratch mount point (for cromwell it is /cormwell_root) is managed by a monitoring tool ebs-autoscale that is installed when creating a custom AMI configured for Cromwell, and then referencing that AMI when creating Batch compute environments.

Running out of space points to one or more of the following:

If you've created a custom AMI, I suggest launching an instance with it and checking that the monitor is watching the correct location. Do this by checking the log: /var/log/ebs-autoscale.log

If it's not, you'll need to recreate both the AMI and the Batch Compute Environment, and associate the new CE with your Job Queue.

dtenenba commented 5 years ago

Thanks @wleepang ,

Here is what's in that log:

[Fri Mar  1 20:11:08 UTC 2019] Starting Docker EBS autoscaling
[Fri Mar  1 20:11:11 UTC 2019] EBS Autoscaling mountpoint: /scratch
[Fri Mar  1 20:11:11 UTC 2019] Region = us-west-2.
[Fri Mar  1 20:11:11 UTC 2019] Threshold -> 50 :: Used% -> 1%

So it looks like it is watching the /scratch mount point. However, in the job definition that is being run, I see the following:

 "volumes": [
            {
                "host": {
                    "sourcePath": "/cromwell_root/Panel_BWA_GATK4_Samtools_Var_Annotate_Split/SamToFastq/57f71856-abbf-4442-a80d-3f532e4eac01/Some(0)/1"
                },
                "name": "local-disk"
            }
        ],

So it looks like it's writing data to /cromwell_root not /scratch. How do I get jobs to write their data to /scratch?

Should I create a new AMI and change the scratch mount point from /scratch to /cromwell_root?

dtenenba commented 5 years ago

Uh, sorry, /cromwell_root is the container path, not the host path. Not sure what the host path is. The string scratch does not occur in this job definition so I assume regardless that /scratch is not being used....

dtenenba commented 5 years ago

I created a new AMI using the stack at https://s3.amazonaws.com/aws-genomics-workflows/templates/aws-genomics-ami.template.yaml and I specified /cromwell_root as the scratch mount point.

image

However, when I then ran an instance of the new AMI, the log shows the same result as before:

[Fri Mar  1 20:58:15 UTC 2019] EBS Autoscaling mountpoint: /scratch

There is a /scratch directory but no /cromwell_root directory.

dtenenba commented 5 years ago

Now I am having trouble running the stack again (I've edited it to change the default of ScratchMountPoint to /cromwell_root). I deleted the old stacks. Getting a failure when trying to create the ec2 instance, but there is no helpful error as to why.

Is it possible to change the value of AWS_CROMWELL_LOCAL_DISK? Where do I change that? In the config file somewhere? If I could change that from /cromwell_root to /scratch then things ought to work with my existing AMI....

dtenenba commented 5 years ago

Sorry for the noise. I've succeeded in creating an AMI where the scratch partition is called /cromwell_root. I have submitted a job which failed with the error under discussion before, and I'm waiting to see what happens....

dtenenba commented 5 years ago

So it appears that what I did (creating a new AMI with the Scratch Mount Point set to /cromwell_root instead of the default /scratch) cleared up this particular issue.

However, I don't think we should close the issue yet because it doesn't appear to be documented anywhere that this is what you need to do. Until the documentation is updated I'd like to see the issue remain open.

Thanks.

vortexing commented 5 years ago

The solution mentioned by @dtenenba actually addresses a slightly different issue that we thought perhaps was related to call caching, but wires were crossed here. The problem Dan addressed was one we affectionately referred to as the I/O problem. Dan B thought might be related to call caching not working, but it seems perhaps it is not. I'd say the original issue is still very much an issue, but the tangent we went on here is resolved from a Cromwell perspective, but needs better docs from the AWS/ @wleepang perspective.

dtenenba commented 5 years ago

Whoops, yeah, I thought I was writing in an issue that was exclusively about the I/O stuff, but neglected to scroll up. Never mind. Side issue solved (pending doc update), main issue not yet solved.

wleepang commented 5 years ago

@dtenenba , @vortexing - The docs for creating the genomics workflow environment (i.e. AWS Batch and related resources) have been updated. Use of custom AMIs has been deprecated in favor of using EC2 Launch Templates. There's also additional parameter validation under the hood around setting up an environment for Cromwell to avoid these configuration errors.

geoffjentry commented 5 years ago

@dtenenba @vortexing What's the current state of this? I believe I've enabled call caching with PR #4412 but I can't replicate this problem in order to demonstrate that I've resolved it. Further the inputs JSON linked above doesn't seem to exist (and/or I don't have permissions). It seems like the issue shifted a bit over time so perhaps I resolved the current problem istead of the previous one?

vortexing commented 5 years ago

We will test (and I will fix that one missing json - basically I broke it up into two input jsons, one for parameters, and one indicating the batch file). I'll report back with the fixed links of what I run.

dtenenba commented 5 years ago

@geoffjentry I had trouble building this pr:

[error] /work/engine/src/main/scala/cromwell/webservice/SwaggerService.scala:3:35: imported `CromwellApiService' is permanently hidden by definition of object CromwellApiService in package webservice
[error] import cromwell.webservice.routes.CromwellApiService
[error]                                   ^

Any ideas?

geoffjentry commented 5 years ago

@dtenenba how are you building this, including things like scala version? And did you do an sbt clean if it had been used for previous versions?

dtenenba commented 5 years ago

Sorry for the noise, sbt clean did seem to do the trick. We will test & let you know.

geoffjentry commented 5 years ago

No worries, I frequently run afoul of that myself. Let me know how it goes.

vortexing commented 5 years ago

First test (@dtenenba built the PR 4412 and I tested it):

Workflow: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-workflow.wdl First input json: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-parameters.json Second input json: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/broad-containers-batchofOne.json

and... drumroll please...... IT WORKED!!!!!!!!!!!

        "callCaching": {
          "allowResultReuse": true,
          "hit": true,
          "result": "Cache Hit: 98bc2232-f147-419f-9351-49a07daa1720:Panel_BWA_GATK4_Samtools_Var_Annotate_Split.SamToFastq:0",

And the workflow is "generating" the files WAY faster than it should be if it were doing it de novo, so we seem to be getting the correct outputs moved into the new workflow directory as well.

Caveats:
I did test it with an actual batch and it failed with the job definition error. But as long as PR 4412 was not intended to fix THAT issue as well, I can say it appears on the first pass that call caching with AWS backend might very well be working with an outside test!!

vortexing commented 5 years ago

ETA: It seems that the FIRST task did successfully pick up that it had been run before and managed to reuse it's output, BUT every one of the subsequent tasks did not manage to realize that the output of that same task in the previous identical workflow was already in existence. So baby steps?