chango / inferno

A rule-based map-reduce scheduling framework
Other
17 stars 15 forks source link

If tag operation fails in the archiver, the job is run again regardless of result processor successes #17

Closed oldmantaiter closed 10 years ago

oldmantaiter commented 10 years ago

Currently, if the archive operation fails the job is re-run regardless of the result processor actions that have already run successfully and committed their data.

We need to track local state in the inferno master to track the actions that have succeeded/failed to ensure we don't re-run them in case of any failures in the result processor and job cleanup/final stages.

pooya commented 10 years ago

A checkmark can be used to keep track of what has been done. See pull #18.

pooya commented 10 years ago

Instead of using checkmarks, we added an exponential back-off mechanism to inferno to retry these operations. If all of the retries fail, inferno just gives up. Moreover, if there is a segmentation fault, OOM kill, or something else that does not let inferno retry, we still might execute the result processor more than once.