HumanCellAtlas / secondary-analysis

Secondary Analysis Service of the Human Cell Atlas Data Coordination Platform
https://pipelines.data.humancellatlas.org/ui/
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Add maxRetries to HCA workflow options #720

Closed kbergin closed 5 years ago

kbergin commented 5 years ago

The Optimus workflow occasionally fails due to transient errors that can be fixed with automated retries. A default value for the maxRetries runtime parameter can be configured in the workflow options file.

Have Lira set a default value based on what is specified in its config, so that it is automatically applied to every task. This also requires explicitly setting the maxRetries value to zero in submit wdl tasks that should not be retried.

┆Issue is synchronized with this Jira Story

kbergin commented 5 years ago

➤ Charles Yan commented:

Currently Lira downloads Cromwell options for a workflow and passes them directly along to Cromwell. In order to implement this change, I believe we’ll need to modify the contents before passing it along. A couple questions come to mind:

kbergin commented 5 years ago

➤ Saman Ehsan commented:

These PRs are ready for another round of review!

https://github.com/HumanCellAtlas/lira/pull/168 ( https://github.com/HumanCellAtlas/lira/pull/168|smart-link ) https://github.com/HumanCellAtlas/skylab/pull/226 ( https://github.com/HumanCellAtlas/skylab/pull/226|smart-link ) https://github.com/HumanCellAtlas/pipeline-tools/pull/151 ( https://github.com/HumanCellAtlas/pipeline-tools/pull/151|smart-link )

kbergin commented 5 years ago

➤ Saman Ehsan commented:

Note: We shouldn’t merge in the Lira changes unless the “workflow-hash-label” commit is reverted or the notification time-out error is fixed.

kbergin commented 5 years ago

➤ Saman Ehsan commented:

QA notes:

  1. Find an AdapterOptimus workflow id via job manager (e.g. “e5f67eac-1826-49c0-b518-6ebe9d456475”)
  2. Go to https://cromwell.caas-prod.broadinstitute.org/swagger/index.html?url=/swagger/cromiam.yaml#/ ( https://cromwell.caas-prod.broadinstitute.org/swagger/index.html?url=/swagger/cromiam.yaml#/ ) and use the “metadata” endpoint to retrieve workflow metadata
  3. Look at the number of maxRetries defined in “options”. Confirm that this is the same number used for maxRetries in the task runtime options.
kbergin commented 5 years ago

➤ Chengchen Wang commented:

QAed, looks good to me!