collectivemedia / celos

Scriptable scheduler for periodical Hadoop workflows
Apache License 2.0
22 stars 9 forks source link

Database #8

Closed manuel closed 10 years ago

manuel commented 11 years ago

Keeps track of data returned by triggers #2 and the status of executed workflows #3.

collectivemedia/tracker#36 @collectivemedia/syn-datapipe2

ivorwilliams commented 11 years ago

Why do we need to keep track of the status of executed workflows? Isn't that (redundant) data that we get from Oozie?

Is the scheduler the only process that needs to know anything about the trigger data? If so, then to get started we could use an in-memory model of the data, one that reads in serialized data on startup and writes it out on shutdown.

Hide this behind some sort of repository facade and we can replace it with something more sophisticated later. It will let us get to the hard parts of the project sooner.

ivorwilliams commented 11 years ago

In addition, we should make the repository implementation pluggable. This will simplify the required CI environment (#6).

malov commented 11 years ago

@ivorwilliams running GC we constantly need to answer questions why X happened on the date Y, where Y could be up to few month back (didn't go that far, but could). In this case the typical procedure is to get the date, get the hour, look how workflow ran, etc. More than once I had to check Oozie status for date Y - logs, dependencies picked, etc. So, I would prefer to keep a history of runs, at least for a few weeks (month?) back with trigger status, logs, etc to help with similar investigations.

manuel commented 11 years ago

The states of workflows the database needs to store:

states

manuel commented 11 years ago

Configuration and dynamic state:

config-db

malov commented 11 years ago

Just a quick note : Workflow can be in FAILED state, and Workflow can be in KILLED state. Former assumes re-try is possible (from Oozie point of view), second one - also from Oozie perspective - that re-try is not possible. However we treat it in our system, is entirely up to us, but we should probably distinguish between two.

manuel commented 11 years ago

Thanks for bringing this up. I think we should just ignore these additional Oozie states for now. Retries should be configurable on a per-workflow basis #10 and if a job is configured as retryable it will simply be retried, regardless of its specific Oozie status.

malov commented 11 years ago

That we could do, I agreed. However (for further expansion) FAILED might mean that only one step of a workflow failed, and it can be re-tried. Sometime it might be cheaper than to re-try full workflow.

ivorwilliams commented 11 years ago

A few observations:

malov commented 11 years ago

I'm not sure I understand - "retries can be handled in memory, so retry information doesn't need to be persisted" . As I mentioned earlier - Pythia and some related workflows must not be retried automatically, so that information must be available to Scheduler at the time of the workflow start.

ivorwilliams commented 11 years ago

There are two types of retry:

  1. Where data is missing. I think that even Pythia should retry if data is absent at the nominal time for a run. Am I right?
  2. Where the execution fails. I think that this should be optional, on a per-workflow basis. You don't want this to happen for Pythia, I believe.

I see both of these retries handled within the actor system, by queuing up a message using the scheduler in Akka. If we wanted to preserve the state of these queues (I don't think we do) some form of durable mailbox would probably be the best option.

malov commented 11 years ago

Got it, and agree with both.

malov commented 11 years ago

One things we could have in mind for future expansion - manual re-try of a group of related workflows all together. For example if either Avroify workflow fails - ultimately it causes to fail Pythia-Main, then Pythia-Hive, then Edge2. Once Avroify is good to go - it would be so much easier if we could start all four for the same hour, as they provide inputs for each other.

manuel commented 11 years ago

@malov: manual re-try of a group of related workflows all together

I think this can be handled in the monitoring user interface #4. Basically we'd just need a convenient "multiple select" to select a bunch of workflows and then resubmit them.

manuel commented 10 years ago

Implemented FS-based database for initial integration testing with Oozie.