Automatic rerunning of failed workflows

collectivemedia / celos

Scriptable scheduler for periodical Hadoop workflows

Apache License 2.0

22 stars 9 forks source link

Automatic rerunning of failed workflows #10

Closed manuel closed 10 years ago

manuel commented 10 years ago

Automatic (or not) rerunning of failed workflows.

References: https://github.com/collectivemedia/celos/issues/7#issuecomment-28228992

collectivemedia/tracker#36 @collectivemedia/syn-datapipe2

manuel commented 10 years ago

@malov wrote: One thing we probably don't want to do, is letting Scheduler to re-run killed (and, perhaps failed) jobs automatically. In majority of current cases status "killed" means that something is not right with the cluster (or HBase, or Zookeeper, etc). So one first need to go and make sure that re-running makes sense. In addition to this, re-running multiple Pythia jobs simultaneously, though possible in theory, would bring cluster to its knees - never should be more than two Pythia instances running at the same time. Which probably means that we need sort of a "ready" status for a job schedules for re-run.

On the other hand, for some jobs (e.g. flume ready checks) it would be very convenient to automatically retry them. Typically they fail for transient issues (e.g. nn01 down) and succeed by simply rerunning.

ivorwilliams commented 10 years ago

As per: https://github.com/collectivemedia/celos/issues/7#issuecomment-28229364:

@malov Maybe that should be a per-job setting. Just rerunning Sibyl jobs (one or twice) seems to fix the strange, transient errors we see in the Sqoop or Hive-based extract jobs.

malov commented 10 years ago

To clarify what I'm saying - I agree that we might want to have an ability to automatically re-run killed or failed workflow, I might even agree to make it a default. However, we should also have a way NOT to re-run killed/failed job automatically, and we should have an ability to schedule re-runs one at a time.

andry1 commented 10 years ago

I'm hoping that this new system in addition to just a better put-together cluster environment will start making these all-too-common "rerun it a few times and it'll probably work" cases a thing of the past, but there are still valid cases for automatic retries IMHO. The flume ready-checks sort of rely on that to work in the first place (although this would be a case of an input dependency check and not a whole workflow), and even in an ideal environment stuff will still happen that might lead to cases where an automatic rerun or two might save us some headaches. We'd want to avoid letting it get into a situation where it endlessly reruns failed jobs though.