broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
998 stars 361 forks source link

error calling RegisterJobDefinition when running workflow on AWS Batch #4496

Open dtenenba opened 5 years ago

dtenenba commented 5 years ago

Backend: AWS Batch

Workflow

Input file

Configuration file

Running this workflow on AWS Batch (with cromwell-36.jar) consistently fails at the same point each time.

It gets through most (looks like all but one iteration) of the scatter loop that calls the BaseRecalibrator task. Then cromwell just sits for a long time (~1hr) with no Batch jobs running (or runnable or starting). Then cromwell calls the RegisterJobDefinition API of AWS Batch, and it always fails with the following error message:

2018-12-15 23:39:03,360 cromwell-system-akka.dispatchers.backend-dispatcher-258 ERROR - AwsBatchAsyncBackendJobExecutionActor [UUID(8adb5141)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:1:1]: Error attempting to Execute
ERROR - AwsBatchAsyncBackendJobExecutionActor [UUID(8adb5141)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:1:1]: Error attempting to Execute
software.amazon.awssdk.services.batch.model.ClientException: arn:aws:batch:us-west-2:064561331775:job-definition/PreProcessingForVariantDiscovery_GATK4-BaseRecalibrator not found or versions do not match (Service: null; Status Code: 404; Request ID: 9914238b-00c2-11e9-a13d-cdc28a8016c8)

Looking at cloudtrail, here is the event associated with that request ID:

Event

If I pull out just the contents of the requestParameters section and call RegisterJobDefinition using the AWS CLI as follows, it works fine.

aws batch register-job-definition --cli-input-json file://event_history.json
{
    "jobDefinitionArn": "arn:aws:batch:us-west-2:064561331775:job-definition/PreProcessingForVariantDiscovery_GATK4-BaseRecalibrator:207",
    "jobDefinitionName": "PreProcessingForVariantDiscovery_GATK4-BaseRecalibrator",
    "revision": 207
}

Each run of this workflow creates many new revisions of this job definition so I'm not sure why this particular call fails.

I tried a couple things to mitigate this, neither of which worked:

  1. I deleted all revisions of the job definition. Creating a new one however will still start at N+1 if there were N revisions before the deletion.
  2. I renamed the workflow (in the .wdl and .json files) and ran it, which creates a new job definition with revisions starting at 1

This is basically blocking my progress on porting the gatk4-data-processing workflow to AWS Batch.

@wleepang @geoffjentry @ruchim @bshifaw @vdauwera

ruchim commented 5 years ago

Hey Dan!

  1. did you happen to save the metadata file from this run?
  2. It seems like the outputs were generated for the BaseRecalibrator task, but did the logs ever say “....transitioned from running to Done” for any of the BaseRecalibrator shards?
dtenenba commented 5 years ago

Hi Ruchi,

  1. I have the metadata from a later run which failed in the same way, it's here.

  2. Yes, it seems like most of the BaseRecalibrator tasks transitioned from running to Done. I see this in the logs, right before the error:

2018-12-17 17:11:25,368 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:5:1]: Status change from Running to Succeeded
2018-12-17 17:11:25,437 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:10:1]: Status change from Running to Succeeded
2018-12-17 17:11:27,559 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:1:1]: Status change from Initializing to Running
2018-12-17 17:11:36,844 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:9:1]: Status change from Initializing to Running
2018-12-17 17:11:51,970 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:15:1]: Status change from Initializing to Running
2018-12-17 17:11:53,801 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:11:1]: Status change from Running to Succeeded
2018-12-17 17:11:55,351 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:7:1]: Status change from Running to Succeeded
2018-12-17 17:21:01,047 cromwell-system-akka.dispatchers.backend-dispatcher-3472 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:3:1]: Status change from Running to Succeeded
2018-12-17 17:22:17,501 cromwell-system-akka.dispatchers.backend-dispatcher-3465 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:9:1]: Status change from Running to Succeeded
2018-12-17 17:22:18,916 cromwell-system-akka.dispatchers.backend-dispatcher-3508 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:8:1]: Status change from Running to Succeeded
2018-12-17 17:26:22,872 cromwell-system-akka.dispatchers.backend-dispatcher-3462 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:15:1]: Status change from Running to Succeeded
2018-12-17 17:27:14,124 cromwell-system-akka.dispatchers.backend-dispatcher-3459 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:14:1]: Status change from Running to Succeeded
2018-12-17 17:28:11,722 cromwell-system-akka.dispatchers.backend-dispatcher-3495 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:4:1]: Status change from Running to Succeeded
2018-12-17 17:28:40,579 cromwell-system-akka.dispatchers.backend-dispatcher-3495 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:0:1]: Status change from Running to Succeeded
2018-12-17 17:29:06,703 cromwell-system-akka.dispatchers.backend-dispatcher-3495 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:1:1]: Status change from Running to Succeeded
vortexing commented 5 years ago

This has not improved post V37 and is still killing the great majority of our jobs. All of us have now hit this essential problem on nearly all workflows. This issue is actively preventing us from using Cromwell with AWS (even in the absence of call caching which makes it inefficient, this makes it IMPOSSIBLE to get jobs run). Is this something AWS can help with @wleepang?

dtenenba commented 5 years ago

OK, per @wleepang 's suggestion we changed the job throttling rate from

system {
  job-rate-control {
    jobs = 5
    per = 1 second
  }

to

system {
  job-rate-control {
    jobs = 1
    per = 2 second
  }

and it seems to have fixed this, at least we haven't seen it yet when we normally would.

But we have some concerns, that if we had a ton of quick tasks that we wanted to run in parallel, that they would effectively run serially because only one job would start every 2 seconds. Is there some sweet spot that's a little faster than this that might work better? Support for array jobs (as tracked in #4707) would be a nice solution.

Any other advice to maintain "paralellness" given a slower job submission rate?

Thanks!

vortexing commented 5 years ago

With the rate throttled BOTH the jobDefinition errors AND the Too Many Requests errors have both been eliminated for a workflow that has 133 independent shards. However severe throttling of the job rate seems like something you need to be presented with up front as this only is acceptable for long running jobs with independent processes.

wleepang commented 5 years ago

The issue at play is that the AWS backend uses about 3-4 Batch API calls when Cromwell submits a job. Also Cromwell submits multiple jobs concurrently. Altogether, this hits the Batch API request limit pretty quickly.

It's worth noting that the requests from Cromwell are to get jobs on a Batch Job Queue. If you are starting with a "cold" Batch environment, it will take a couple minutes for an instance to spin up. Batch will then place as many of the queued jobs as it can on instances. So in effect you should still get plenty of parallelism, despite jobs having been queued at a throttled rate initially.