broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
987 stars 358 forks source link

User-defined retries #1991

Open kshakir opened 7 years ago

kshakir commented 7 years ago

Briefest of discussions with Jose. NOTE: All naming up in the air.

Enable a runtime attribute such as retryOnStderrPattern that populates a value retryAttempt/retry/retryCount/retry_count/etc. This will enable tasks such as:

task mytask {
  command {
     mycommand.sh
  }
  runtime {
    retryOnStderrPattern = "(OutOfMemoryError|disk quota exceeded)"
    memory = (6 * retryAttempt) + "GB"
    disk = "local-disk " + (100 * retryAttempt) + " SSD"
    docker = "myrepo/myimage"
  }
}

When the stderr contains the specified regular expression pattern, the job should be retried with the counter incremented.

Not discussed afaik, how to limit the number of attempts: another runtime attribute, a backend config value, both, other?

katevoss commented 7 years ago

Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742:

We may be able to put in error codes for things like this in GATK4. Should ask David Roazen or Louis Bergelson.

kcibul commented 7 years ago

Just a though --what if we added a "retryOn" attribute that took a boolean expression. Then we add a function like grep(pattern, var/file, mode) which would probably be similar work and much more reusable. Then Jose could write something like

retryOn = grep("(foo|bar", $stderr, "any")

But could also check any other element. Syntax is lousy typing from tiny keyboard

On Feb 16, 2017, at 11:23 AM, Kate Voss notifications@github.com wrote:

Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742:

We may be able to put in error codes for things like this in GATK4. Should ask David Roazen or Louis Bergelson.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

katevoss commented 7 years ago

This is de-prioritized for the time being.

droazen commented 7 years ago

Could I propose that this be re-prioritized? It would help us deal with transient GCS hiccups in production (eg., connections suddenly getting closed, etc.). Individual tools in the GATK and Picard can't possibly catch every exception across every library involved, so an execution-framework-level retry at the job level would help enormously.

lbergelson commented 7 years ago

This would be extremely useful for us. We're currently having to deal with several problems that would be helped by an automated retry ability.

The first problems is what David said, we have tools that can fail sometimes due to GCS issues and being able to restart when that happens would be useful. We're working on making our code more robust to that, but it's difficult to completely fix the problem. Having to restart a workflow with 10s of thousands of jobs because 2 failed is pretty annoying.

The second problem is out of memory issues. We have thousands of jobs, and most will run with a small amount of memory, but some of them will need more. It's difficult to predict ahead of time which shards will need more since it's a function of the data rather than of the file size. Having a way to automatically retry these shards with increased memory would be really valuable since it would let us provision for the average shard rather than the worst case.

LeeTL1220 commented 7 years ago

@katevoss This is actually really important, not just for @droazen and @lbergelson ... This issue has cost the Broad $$$ and analysts a lot of time. Not just the people on this issue. And putting retry code into the GATK (or any task for that matter) is bit arduous and actually a more expensive solution, especially when some random code path is missed.

Also, retry on memory should do a lot for us to be able to reduce costs.

kcibul commented 7 years ago

+1 on this feature (or one like it) -- it's really helpful for writing robust and cheap workflows

geoffjentry commented 7 years ago

FWIW this has been discussed as a key feature for a WDL push next quarter

gsaksena commented 7 years ago

Also, note that Google PD's can be expanded on the fly in seconds, even while the VM is still running under load. I've done this manually on non-FC VMs via the script below. Using this approach combined with a disk space monitoring process (and a size cap!) would allow the job to pass the first time, avoiding a retry. And... if it was also during the algorithm, not just data download, this could eradicate both disk space errors and disk over-provisioning.

https://github.com/broadinstitute/firecloud_developer_toolkit/blob/master/gce/expand_disk.sh

Unfortunately I don't know of a way to hot-swap RAM into the VM.

katevoss commented 7 years ago

From #1449, make sure "custom retries can increase memory, disk, AND bootDiskInGb".

katevoss commented 7 years ago

From #1729, this is also important to @ktibbett and the Emerald Empire.

katevoss commented 7 years ago

As a workflow runner, I want Cromwell to automatically retry my workflow with increased memory/disk/on a specific error code, etc, so that I can get my workflow to complete without having to manually intervene.

droazen commented 6 years ago

Just wanted to check in to inquire about the status of this ticket. Has it been added to any milestones yet?

Although we've packed Google's GCS library full of retries, at considerable expense of developer time and effort, there are still blips in production where something like a host name lookup will randomly fail, as @jsotobroad can attest. Since it's difficult/impossible to bake in retries for every network operation in every library involved, as discussed above, this ticket is our best hope of dealing with these annoying little blips once and for all.

katevoss commented 6 years ago

@droazen I wish I had better news for you but unfortunately this has not made it into our priorities. In early December we will have more flexibility to take on additional features, and I will note this one as a possible item.

@geoffjentry do you have an estimate for the effort it would take? Could it be a User Driven Development project if a team member is so motivated?

LeeTL1220 commented 6 years ago

This issue causes us a lot of pain and lost time.

On Tue, Nov 14, 2017 at 2:51 PM, Kate Voss notifications@github.com wrote:

@droazen https://github.com/droazen I wish I had better news for you but unfortunately this has not made it into our priorities. In early December we will have more flexibility to take on additional features, and I will note this one as a possible item.

@geoffjentry https://github.com/geoffjentry do you have an estimate for the effort it would take? Could it be a User Driven Development project if a team member is so motivated?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/1991#issuecomment-344377168, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDXkzk8mnnWtEcY07IcNkF7jyrCW-ecks5s2e88gaJpZM4MCdgW .

-- Lee Lichtenstein Broad Institute 75 Ames Street, Room 8011A Cambridge, MA 02142 617 714 8632

snovod commented 6 years ago

Just wanted to add this causes Ops LOTs of pain. If this could be addressed it would be a huge help.

geoffjentry commented 6 years ago

If the implementation of this involves WDL it's no longer an "us" thing. There are other possible ways to implement this, but to date I've only seen/heard WDL-based ones (and it seems like the right path to do it that way)

droazen commented 6 years ago

@katevoss Ah, that's too bad. As @snovod and @LeeTL1220 mention above, this is a massive pain point for both Ops and Methods right now, and wastes a lot of our time, so if it could somehow get prioritized in the near future we'd be grateful!

katevoss commented 6 years ago

In that case, I shall bring this to the attention of Commodore WDL ( @cjllanwarne )!

geoffjentry commented 6 years ago

I'd encourage interested parties (e.g. @drozen) to directly interface w/ OpenWDL

eitanbanks commented 6 years ago

What does that mean? How do we get you guys to prioritize work on it? (This is directed at Jeff's comment that we should interface elsewhere)

geoffjentry commented 6 years ago

@eitanbanks To modify the WDL requires a change to the WDL spec. There's a process that's not directly controlled by the Cromwell team (or the Broad): https://github.com/openwdl/wdl/blob/master/GOVERNANCE.md#rfc-process

IOW on our side of the fence we can try to work on getting it in but there's no guarantee of it ever happening even if we want it to (note: that's a worst case, I don't expect that'd really happen)

geoffjentry commented 6 years ago

The other relavent bit I should mention is that there's a separate WDL team starting next quarter (currently just @cjllanwarne but hopefully by then it'll be 2 ppl) - we've been talking about managing it in a manner similar to the field engineering team but that's still kind of up in the air at the moment.

eitanbanks commented 6 years ago

Right, understood about the WDL spec being open. But it's not helpful for us to have to advocate to an external committee to get features added in. The hope is that this team (or the new WDL team) can represent us in those discussions. Our goal is to try to convince Kate that this is important enough to prioritize.

geoffjentry commented 6 years ago

Sure, but the second point I made is that that's also likely not how prioritization will work for the WDL team

eitanbanks commented 6 years ago

So bribery, like with Field Eng?

LeeTL1220 commented 6 years ago

Seriously: What is the prioritization process here?

Not seriously: Do I need to go to an ATM?

On Tue, Nov 14, 2017 at 4:30 PM, Eric Banks notifications@github.com wrote:

So bribery, like with Field Eng?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/1991#issuecomment-344405322, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDXk2zeQ6zcnmqLNr8NBIKE8ZPJSFI3ks5s2gaTgaJpZM4MCdgW .

-- Lee Lichtenstein Broad Institute 75 Ames Street, Room 8011A Cambridge, MA 02142 617 714 8632

droazen commented 6 years ago

Methods and Ops could put together a bribe package if necessary...

snovod commented 6 years ago

We should stick to our respected roles. You put it together and we will ensure it gets delivered.

awacs commented 6 years ago

This would be extremely helpful.

LeeTL1220 commented 6 years ago

Ping! We still would like this. Is it on the roadmap?

guma44 commented 4 years ago

Hey, is there any info on this ticket? It is really desirable and it is hanging for two years now.

gemmalam commented 4 years ago

@guma44 we have moved to Jira and this ticket is in review at the moment on our new board. Please check out this ticket https://broadworkbench.atlassian.net/browse/BA-5933 or this Pull Request for updates https://github.com/broadinstitute/cromwell/pull/5180

aednichols commented 4 years ago

The PR that merged addresses some, but not all of the scope of this issue. It specifically targets running out of memory. I believe it may lay a helpful foundation for future work, however.

To set some expectations @guma44 we get many more issues than we have time to work on, and our institutional sponsors (who pay the rent) get first dibs on selecting the most important ones.

We are always happy to consider PRs if you feel like contributing yourself.

guma44 commented 4 years ago

@gemmalam, @aednichols Thanks for reply. I see the PR very promising for our setup. It would probably solve our problems for now. Concerning, contribution I would do this if I knew more Java/Scala but I have not been programming in those languages for years.