Closed manuel closed 10 years ago
Why do we need to keep track of the status of executed workflows? Isn't that (redundant) data that we get from Oozie?
Is the scheduler the only process that needs to know anything about the trigger data? If so, then to get started we could use an in-memory model of the data, one that reads in serialized data on startup and writes it out on shutdown.
Hide this behind some sort of repository facade and we can replace it with something more sophisticated later. It will let us get to the hard parts of the project sooner.
In addition, we should make the repository implementation pluggable. This will simplify the required CI environment (#6).
@ivorwilliams running GC we constantly need to answer questions why X happened on the date Y, where Y could be up to few month back (didn't go that far, but could). In this case the typical procedure is to get the date, get the hour, look how workflow ran, etc. More than once I had to check Oozie status for date Y - logs, dependencies picked, etc. So, I would prefer to keep a history of runs, at least for a few weeks (month?) back with trigger status, logs, etc to help with similar investigations.
The states of workflows the database needs to store:
Configuration and dynamic state:
Just a quick note : Workflow can be in FAILED state, and Workflow can be in KILLED state. Former assumes re-try is possible (from Oozie point of view), second one - also from Oozie perspective - that re-try is not possible. However we treat it in our system, is entirely up to us, but we should probably distinguish between two.
Thanks for bringing this up. I think we should just ignore these additional Oozie states for now. Retries should be configurable on a per-workflow basis #10 and if a job is configured as retryable it will simply be retried, regardless of its specific Oozie status.
That we could do, I agreed. However (for further expansion) FAILED might mean that only one step of a workflow failed, and it can be re-tried. Sometime it might be cheaper than to re-try full workflow.
A few observations:
celos
, and that we should disable all Oozie-based retry functionality.celos
. That isn't much of a down side, IMO.I'm not sure I understand - "retries can be handled in memory, so retry information doesn't need to be persisted" . As I mentioned earlier - Pythia and some related workflows must not be retried automatically, so that information must be available to Scheduler at the time of the workflow start.
There are two types of retry:
I see both of these retries handled within the actor system, by queuing up a message using the scheduler in Akka. If we wanted to preserve the state of these queues (I don't think we do) some form of durable mailbox would probably be the best option.
Got it, and agree with both.
One things we could have in mind for future expansion - manual re-try of a group of related workflows all together. For example if either Avroify workflow fails - ultimately it causes to fail Pythia-Main, then Pythia-Hive, then Edge2. Once Avroify is good to go - it would be so much easier if we could start all four for the same hour, as they provide inputs for each other.
@malov: manual re-try of a group of related workflows all together
I think this can be handled in the monitoring user interface #4. Basically we'd just need a convenient "multiple select" to select a bunch of workflows and then resubmit them.
Implemented FS-based database for initial integration testing with Oozie.
Keeps track of data returned by triggers #2 and the status of executed workflows #3.
collectivemedia/tracker#36 @collectivemedia/syn-datapipe2