broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
972 stars 354 forks source link

Failed to evaluate job outputs - IOException: Could not read from s3... #4687

Open doron-st opened 5 years ago

doron-st commented 5 years ago

While testing cromwell-36 with AWS batch I was able to reproduce this error:

2019-02-25 09:38:52,508 cromwell-system-akka.dispatchers.engine-dispatcher-24 ERROR - WorkflowManagerActor Workflow b6b9322c-3929-4b72-9598-45d97dfb858d failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs:
Bad output 'print_nach_nachman_meuman.out': [Attempted 1 time(s)] - IOException: Could not read from s3://nrglab-cromwell-genomics/cromwell-execution/run_multiple_tests/b6b9322c-3929-4b72-9598-45d97dfb858d/call-test_cromwell_on_aws/shard-61/SingleTest.test_cromwell_on_aws/f8ecf673-ed61-4b06-b1d6-c20f7efe986e/call-print_nach_nachman_meuman/print_nach_nachman_meuman-stdout.log: Cannot access file: s3://s3.amazonaws.com/nrglab-cromwell-genomics/cromwell-execution/run_multiple_tests/b6b9322c-3929-4b72-9598-45d97dfb858d/call-test_cromwell_on_aws/shard-61/SingleTest.test_cromwell_on_aws/f8ecf673-ed61-4b06-b1d6-c20f7efe986e/call-print_nach_nachman_meuman/print_nach_nachman_meuman-stdout.log
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:867)

The error occurs when running many sub-workflows within a single wrapping workflow. The environment is configured correctly, and the test usually passes when running <30 subworkflows.

Here are the workflows:

run_multiple_test.wdl

import "three_task_sequence.wdl" as SingleTest

workflow run_multiple_tests {
    scatter (i in range(30)){
        call SingleTest.three_task_sequence{}
    }
}

three_task_sequence.wdl

workflow three_task_sequence{
    call print_nach

    call print_nach_nachman {
        input:
            previous = print_nach.out
    }

    call print_nach_nachman_meuman{
        input:
                previous = print_nach_nachman.out
    }
    output{
        Array[String] out = print_nach_nachman_meuman.out
    }
}

task print_nach{
     command{
         echo "nach"
     }
     output{
         Array[String] out = read_lines(stdout())
     }
     runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
     }
 }

 task print_nach_nachman{
    Array[String] previous

     command{
         echo ${sep=' ' previous} " nachman"
     }
     output{
         Array[String] out = read_lines(stdout())
     }
     runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
     }

 }

 task print_nach_nachman_meuman{
     Array[String] previous

      command{
        echo ${sep=' ' previous} " meuman"
      }
      output{
        Array[String] out = read_lines(stdout())
      }
      runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
      }
  }

Here is the cromwell-conf:

// aws.conf
include required(classpath("application"))

webservice {
  port = 8001
  interface = 0.0.0.0
}

aws {
  application-name = "cromwell"
  auths = [{
      name = "default"
      scheme = "default"
  }]
  region = "us-east-1"
}

engine {
  filesystems {
    s3 { auth = "default" }
  }
}

backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        root = "s3://nrglab-cromwell-genomics/cromwell-execution"
        auth = "default"

        numSubmitAttempts = 3
        numCreateDefinitionAttempts = 3

        concurrent-job-limit = 100

        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-east-1:66:job-queue/GenomicsDefaultQueue"
        }

        filesystems {
          s3 {
            auth = "default"
          }
        }
      }
    }
  }
}

system {
  job-rate-control {
    jobs = 1
    per = 1 second
  }
}

Would appreciate help on this. I wonder if cromwell was ever tested for many parallel sub-workflows running on AWS?

Thanks!

caaespin commented 4 years ago

Hey, did you ever manage to get a workaround for this error?

geoffjentry commented 4 years ago

@caaespin I'm assuming that means you still see this. Are you using a recent Cromwell version? (42+)

caaespin commented 4 years ago

@geoffjentry yes. My current deployment is v42.

If you have access to the GATK forums, i put more details in my post there: https://gatkforums.broadinstitute.org/wdl/discussion/24268/aws-batch-randomly-fails-when-running-multiple-workflows/p1?new=1

marpiech commented 4 years ago

One up. I have similar error

caaespin commented 4 years ago

@geoffjentry from inspecting logs and AWS Batch console, i think what is happening is that the jobs fail because Cromwell shutdowns the VMs earlier than expected. So one of shard hasn't finished and is unable to upload to S3, hence the problem here occurs. Anyways this is a hypothesis based on what I saw, hopefully is helpful.

alexwaldrop commented 4 years ago

@geoffjentry Any movement on this? I'm having this same issue sporadically (v48 + AWS backend) with workflows that contain large scatter operations.

geoffjentry commented 4 years ago

@alexwaldrop NB that I don't work there anymore and sadly haven't had the energy to actively contribute. Perhaps @aednichols can chime in

blindmouse commented 4 years ago

I am having the same error with the example "Using Data on S3" on https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-examples/ . I have changed the S3 bucket name in the .json file to my bucket name, but the run still failed. After reporting running failure, I have got the same error message. I am using cromwell-48. The S3 bucket has all public access, and I was logged in as the Admin in two terminal windows, one running the server and the other submitting the job. The previous two hello-world example were successful. There is no log file in the bucket and in the cromwell-execution, the only file create was the script. There is no rc or stderr or stdout created.

sripaladugu commented 3 years ago

I am having the same error with the example "Using Data on S3" on https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-examples/ . I have changed the S3 bucket name in the .json file to my bucket name, but the run still failed. After reporting running failure, I have got the same error message. I am using cromwell-48. The S3 bucket has all public access, and I was logged in as the Admin in two terminal windows, one running the server and the other submitting the job. The previous two hello-world example were successful. There is no log file in the bucket and in the cromwell-execution, the only file create was the script. There is no rc or stderr or stdout created.

@blindmouse Were you able to resolve your issue? I am encountering the same problem. Thanks.

markjschreiber commented 3 years ago

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job.

On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu notifications@github.com wrote:

Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-662079379, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

sripaladugu commented 3 years ago

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job. On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu @.***> wrote: Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4687 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

markjschreiber commented 3 years ago

It may be that you’re running Cromwell 52 or later with an older AWS CloudFormation built infrastructure. Can you share which build of Cromwell you’re using and the build/ version/ origin of the CloudFormation template?

On Tue, Jul 21, 2020 at 8:18 PM Sri Paladugu notifications@github.com wrote:

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job. … <#m-7712250081708699723> On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu @.***> wrote: Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4687 (comment) https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-662079379>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-662170952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6ENOHHXQP6VC5XUGZ5TR4YV5XANCNFSM4G23FFUQ .

mderan-da commented 3 years ago

Hi @markjschreiber I'm also running into this error. I am using cromwell 53 with a custom cdk stack based on the CloudFormation infrastructure described here: https://docs.opendata.aws/genomics-workflows/

Are modifications needed for compatibility with newer versions of Cromwell? Are these documented somewhere?

markjschreiber commented 3 years ago

Attached is some documentation that works for v52 and should work for v53

On Wed, Sep 9, 2020 at 9:20 AM mderan-da notifications@github.com wrote:

Hi @markjschreiber https://github.com/markjschreiber I'm also running into this error. I am using cromwell 53 with a custom cdk stack based on the CloudFormation infrastructure described here: https://docs.opendata.aws/genomics-workflows/

Are modifications needed for compatibility with newer versions of Cromwell? Are these documented somewhere?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-689558662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EO6WEE4BYYPTX4HZ2LSE56JXANCNFSM4G23FFUQ .

mderan-da commented 3 years ago

Hi @markjschreiber Thanks but it looks like the attachment didn't come through.

yaomin commented 3 years ago

@markjschreiber running into the same error for both v52 and v53.1. I am using the same CloudFormation @mderan-da mentioned . Appreciate your newer documentation on this.

markjschreiber commented 3 years ago

Documentation can be downloaded from here https://cromwell-share-ad485.s3.us-east-2.amazonaws.com/InstallingGenomicsWorkflowCoreWithCromwel52.pdf

On Sun, Sep 13, 2020 at 4:48 PM Yaomin Xu notifications@github.com wrote:

@markjschreiber https://github.com/markjschreiber running into the same error for both v52 and v53.1. I am using the same CloudFormation @mderan-da https://github.com/mderan-da mentioned . Appreciate your newer documentation on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-691723254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EKCM56WST3J6NO5CS3SFUVYLANCNFSM4G23FFUQ .

dfeinzeig commented 3 years ago

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

Also have this error. Anyone figure out what the issue is?

geertvandeweyer commented 3 years ago

Also have this error, using Cromwell 52, installed using this manual :

https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

logs say : fetch_and_run.is is a directory.

geertvandeweyer commented 3 years ago

Also have this error, using Cromwell 52, installed using this manual :

https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

logs say : fetch_and_run.is is a directory.

Extra info : cloning job & resubmitting through aws console runs fine. so it seems to be a temporary issue

sscho commented 3 years ago

Hmmm, still stuck on this - any updates from your guys' end? I tried cloning and resubmitting, still getting the same error.

ptdtan commented 3 years ago

Still getting this error today.

alimayy commented 1 year ago

I'm getting this error almost certainly when I run workflows where more samples (e.g. 96) than usual are scattered. Cromwell version: 60-6048d0e-SNAP.

Is there a workaround to this?