Since our DiscoverAndQueueGranules workflow now makes use of Distributed Map states for concurrency, we can now also take advantage of the "redrive" feature of MapRuns, which allows us to easily restart a failed workflow from where it left off. This can be handy for cases where the failure might have been due to a transient issue.
To allow us to redrive a failed MapRun, we must give our workflows permission to perform states:RedriveExecution actions.
This problem cropped up during some "discover only" scalability testing with "duplicateHandling" set to "skip" to see how much longer discovery would take than when "duplicateHandling" is set to "replace", due to the high load of DB queries required when "skip" is used, in order to check for the existence of each granule discovered. Using "replace" makes no such DB queries because it doesn't care about any existing granules -- it simply ingests everything it finds, regardless of current state.
One of the DiscoverGranules MapRuns timed out, but since we currently have ToleratedFailurePercentage set to 0, the first MapRun execution to fail causes the entire workflow to fail. After examining the timeout failure, it appeared to be transient because the particular day being discovered does not have an unusually high number of granules. In fact, it only has 751, which is far below other days that were successfully discovered well within the 15-minute time limit.
Unfortunately, since the policy is not set to allow states:RedriveExecution, it is not possible to redrive the failed workflow from the failed MapRun, which is the reason for this issue.
In addition, we should set ToleratedFailurePercentage to a value greater than 0 to avoid failing the entire workflow. A value of 3 should suffice to allow for an occasional transient MapRun failure, such that a redrive won't be necessary. The individual MapRun failures can then be examined, and perhaps rerun in isolation via a 1-day rule.
Acceptance criteria:
[x] ToleratedFailurePercentage is set to 3 for DiscoverGranulesMap
[x] states:RedriveExecution is added to the "allow" list within the first statement block of data "aws_iam_policy_document" "allow_sfn_distributed_maps" within app/stacks/cumulus/iam.tf to allow for MapRun redrives
Since our DiscoverAndQueueGranules workflow now makes use of Distributed Map states for concurrency, we can now also take advantage of the "redrive" feature of MapRuns, which allows us to easily restart a failed workflow from where it left off. This can be handy for cases where the failure might have been due to a transient issue.
To allow us to redrive a failed MapRun, we must give our workflows permission to perform states:RedriveExecution actions.
This problem cropped up during some "discover only" scalability testing with "duplicateHandling" set to "skip" to see how much longer discovery would take than when "duplicateHandling" is set to "replace", due to the high load of DB queries required when "skip" is used, in order to check for the existence of each granule discovered. Using "replace" makes no such DB queries because it doesn't care about any existing granules -- it simply ingests everything it finds, regardless of current state.
One of the DiscoverGranules MapRuns timed out, but since we currently have
ToleratedFailurePercentage
set to0
, the first MapRun execution to fail causes the entire workflow to fail. After examining the timeout failure, it appeared to be transient because the particular day being discovered does not have an unusually high number of granules. In fact, it only has 751, which is far below other days that were successfully discovered well within the 15-minute time limit.Unfortunately, since the policy is not set to allow
states:RedriveExecution
, it is not possible to redrive the failed workflow from the failed MapRun, which is the reason for this issue.In addition, we should set
ToleratedFailurePercentage
to a value greater than0
to avoid failing the entire workflow. A value of3
should suffice to allow for an occasional transient MapRun failure, such that a redrive won't be necessary. The individual MapRun failures can then be examined, and perhaps rerun in isolation via a 1-day rule.Acceptance criteria:
ToleratedFailurePercentage
is set to3
forDiscoverGranulesMap
states:RedriveExecution
is added to the "allow" list within the firststatement
block ofdata "aws_iam_policy_document" "allow_sfn_distributed_maps"
withinapp/stacks/cumulus/iam.tf
to allow for MapRun redrives