NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 1 forks source link

Make MapRuns redriveable to allow failed discovery workflow to be restarted from a failure point #303

Closed chuckwondo closed 10 months ago

chuckwondo commented 10 months ago

Since our DiscoverAndQueueGranules workflow now makes use of Distributed Map states for concurrency, we can now also take advantage of the "redrive" feature of MapRuns, which allows us to easily restart a failed workflow from where it left off. This can be handy for cases where the failure might have been due to a transient issue.

To allow us to redrive a failed MapRun, we must give our workflows permission to perform states:RedriveExecution actions.

This problem cropped up during some "discover only" scalability testing with "duplicateHandling" set to "skip" to see how much longer discovery would take than when "duplicateHandling" is set to "replace", due to the high load of DB queries required when "skip" is used, in order to check for the existence of each granule discovered. Using "replace" makes no such DB queries because it doesn't care about any existing granules -- it simply ingests everything it finds, regardless of current state.

One of the DiscoverGranules MapRuns timed out, but since we currently have ToleratedFailurePercentage set to 0, the first MapRun execution to fail causes the entire workflow to fail. After examining the timeout failure, it appeared to be transient because the particular day being discovered does not have an unusually high number of granules. In fact, it only has 751, which is far below other days that were successfully discovered well within the 15-minute time limit.

Unfortunately, since the policy is not set to allow states:RedriveExecution, it is not possible to redrive the failed workflow from the failed MapRun, which is the reason for this issue.

In addition, we should set ToleratedFailurePercentage to a value greater than 0 to avoid failing the entire workflow. A value of 3 should suffice to allow for an occasional transient MapRun failure, such that a redrive won't be necessary. The individual MapRun failures can then be examined, and perhaps rerun in isolation via a 1-day rule.

Acceptance criteria: