Open dtenenba opened 5 years ago
Hey Dan!
Hi Ruchi,
I have the metadata from a later run which failed in the same way, it's here.
Yes, it seems like most of the BaseRecalibrator tasks transitioned from running to Done. I see this in the logs, right before the error:
2018-12-17 17:11:25,368 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:5:1]: Status change from Running to Succeeded
2018-12-17 17:11:25,437 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:10:1]: Status change from Running to Succeeded
2018-12-17 17:11:27,559 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:1:1]: Status change from Initializing to Running
2018-12-17 17:11:36,844 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:9:1]: Status change from Initializing to Running
2018-12-17 17:11:51,970 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:15:1]: Status change from Initializing to Running
2018-12-17 17:11:53,801 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:11:1]: Status change from Running to Succeeded
2018-12-17 17:11:55,351 cromwell-system-akka.dispatchers.backend-dispatcher-3452 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:7:1]: Status change from Running to Succeeded
2018-12-17 17:21:01,047 cromwell-system-akka.dispatchers.backend-dispatcher-3472 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:3:1]: Status change from Running to Succeeded
2018-12-17 17:22:17,501 cromwell-system-akka.dispatchers.backend-dispatcher-3465 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:9:1]: Status change from Running to Succeeded
2018-12-17 17:22:18,916 cromwell-system-akka.dispatchers.backend-dispatcher-3508 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:8:1]: Status change from Running to Succeeded
2018-12-17 17:26:22,872 cromwell-system-akka.dispatchers.backend-dispatcher-3462 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:15:1]: Status change from Running to Succeeded
2018-12-17 17:27:14,124 cromwell-system-akka.dispatchers.backend-dispatcher-3459 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:14:1]: Status change from Running to Succeeded
2018-12-17 17:28:11,722 cromwell-system-akka.dispatchers.backend-dispatcher-3495 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:4:1]: Status change from Running to Succeeded
2018-12-17 17:28:40,579 cromwell-system-akka.dispatchers.backend-dispatcher-3495 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:0:1]: Status change from Running to Succeeded
2018-12-17 17:29:06,703 cromwell-system-akka.dispatchers.backend-dispatcher-3495 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(2951ea9d)PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator:1:1]: Status change from Running to Succeeded
This has not improved post V37 and is still killing the great majority of our jobs. All of us have now hit this essential problem on nearly all workflows. This issue is actively preventing us from using Cromwell with AWS (even in the absence of call caching which makes it inefficient, this makes it IMPOSSIBLE to get jobs run). Is this something AWS can help with @wleepang?
OK, per @wleepang 's suggestion we changed the job throttling rate from
system {
job-rate-control {
jobs = 5
per = 1 second
}
to
system {
job-rate-control {
jobs = 1
per = 2 second
}
and it seems to have fixed this, at least we haven't seen it yet when we normally would.
But we have some concerns, that if we had a ton of quick tasks that we wanted to run in parallel, that they would effectively run serially because only one job would start every 2 seconds. Is there some sweet spot that's a little faster than this that might work better? Support for array jobs (as tracked in #4707) would be a nice solution.
Any other advice to maintain "paralellness" given a slower job submission rate?
Thanks!
With the rate throttled BOTH the jobDefinition errors AND the Too Many Requests errors have both been eliminated for a workflow that has 133 independent shards. However severe throttling of the job rate seems like something you need to be presented with up front as this only is acceptable for long running jobs with independent processes.
The issue at play is that the AWS backend uses about 3-4 Batch API calls when Cromwell submits a job. Also Cromwell submits multiple jobs concurrently. Altogether, this hits the Batch API request limit pretty quickly.
It's worth noting that the requests from Cromwell are to get jobs on a Batch Job Queue. If you are starting with a "cold" Batch environment, it will take a couple minutes for an instance to spin up. Batch will then place as many of the queued jobs as it can on instances. So in effect you should still get plenty of parallelism, despite jobs having been queued at a throttled rate initially.
Backend: AWS Batch
Workflow
Input file
Configuration file
Running this workflow on AWS Batch (with cromwell-36.jar) consistently fails at the same point each time.
It gets through most (looks like all but one iteration) of the scatter loop that calls the
BaseRecalibrator
task. Then cromwell just sits for a long time (~1hr) with no Batch jobs running (or runnable or starting). Then cromwell calls theRegisterJobDefinition
API of AWS Batch, and it always fails with the following error message:Looking at cloudtrail, here is the event associated with that request ID:
Event
If I pull out just the contents of the
requestParameters
section and call RegisterJobDefinition using the AWS CLI as follows, it works fine.Each run of this workflow creates many new revisions of this job definition so I'm not sure why this particular call fails.
I tried a couple things to mitigate this, neither of which worked:
This is basically blocking my progress on porting the gatk4-data-processing workflow to AWS Batch.
@wleepang @geoffjentry @ruchim @bshifaw @vdauwera