Closed KevinDuringWork closed 2 years ago
Our bioinformatics team have been reporting a single retry after preemptible attempts have been exhausted.
To clarify, is Cromwell retrying preemptibles the specified number of times and then running one more time on non-preemptible?
As of today that is the expected behavior because it is assumed that a user isn't going to completely give up on their analysis just because it got interrupted repeatedly:
Take an Int as a value that indicates the maximum number of times Cromwell should request a preemptible machine for this task before defaulting back to a non-preemptible one.
A change to categorically disable this behavior would break existing users and can't merge, but what might work is a boolean runtime attribute that skips the regular VM. That said, the team must think carefully about increasing the configuration surface area of the product and I can't promise that such a PR would be accepted.
Hi @aednichols,
For us there's a large price difference between regular vs Spot VM on GCP hence the pursuit of purely pre-emptible pipelines.
You could set preemptible
very high to minimize the chance of preemption. I don't think there would be any issue setting it to 10 or even more.
That said, it can be a bit of a false economy because failed attempts still cost real money. It may even be the case that falling back to non-preemptible saves money.
Let's say preemptibles are $1 an hour and normal VMs are $3.
If you run a 12 hour task that gets preempted 6 times at the 6 hour mark, that's 6 x 6 x $1 = $36 down the drain, a day and a half of wall clock time, and no results to show for it. Whereas a single non-preemptible run would be 12 x $3 = $36 and you'd have your results.
Obviously this math will vary widely by use case and you will have to observe your preemption rates in practice to come up with the optimal balance.
Thanks for an interesting discussion, I had never thought about the "only preemptible" use case before.
Closing issue:
I'm likely going to soft-fork internally for certain projects and gather some hard numbers.
Sounds good, would be interested to see your results.
A category of feature we've brainstormed (but isn't currently on the roadmap) is "only run when the price is below $X" which would pull price lists from AWS/GCP.
Hello Cromwell Team,
Our bioinformatics team have been reporting a single retry after preemptible attempts have been exhausted. They've added logic in the task itself that introspects the vm in the event the job ends up on a non-preemptible VM and promptly exists. This isn't ideal as starting a VM still incurs cost.
I've made the follow changes in:
and tested with a trivial WDL and tasks such as (trying out multiple premptible / maxRetries):
Let me know if I'm going in the right direction for a pull request.