Shopify / pyoozie

Library for querying and scheduling with Apache Oozie
https://py-oozie.readthedocs.io
MIT License
11 stars 12 forks source link

[WIP] Prototype workflow builder interface #25

Closed cfournie closed 7 years ago

cfournie commented 7 years ago

Here's what I'd like to use as an API for the workflow builder that allows people to specify:

This would be the final PR merged to add the XML workflow generation feature, before this there will need to be PRs that:

  1. Add a basic Workflow tag/graph to pyoozie.tags; and
  2. Add some basic transformation functions (and the graph functions required to implement them).

For an initial review, I'd like design comments and questions.

honkfestival commented 7 years ago

I think this would be simpler if you remove the concept of a layout entirely and only support actions with dependencies.

Does it really matter how the actions are laid out if all the dependencies are met? A topological sort on actions with their dependencies as edges should suffice.

cfournie commented 7 years ago

I think this would be simpler if you remove the concept of a layout entirely and only support actions with dependencies.

At least one action must have no dependencies (as an initial action). I'm trying to offer a way for someone to specify:

Does it really matter how the actions are laid out if all the dependencies are met? A topological sort on actions with their dependencies as edges should suffice.

There are two main ways that we can lay out work actions:

The fork-join DAG could be lain out in a wide variety of ways; we'll provide a naive implementation that relies upon just dependency information to create the graph but you may want to define your own strategy which may:

Both are strategies that we've discussed investigating.

honkfestival commented 7 years ago

Sequentially using a topological sort as you suggested; or

Using topological sort doesn't require that we run the jobs sequentially. Rosetta Code has an example in Python of how to sort so that you know which groups can be run in parallel.

  • Take into account median job runtime to rearrange the workflow to minimize the critical path;

Minimizing the critical path necessarily means breaking up the DAG, which would mean breaking one flow into multiple flows. I don't see how manually specifying fork/join points would help (but I'm probably missing some context).

  • Multiply schedule the same job within a workflow and use external synchronization to minimize the critical path.

Sorry, but I don't understand this. Let's chat IRL to make sure I'm not missing something.

cfournie commented 7 years ago

Talked IRL and I'll begin prototyping a different approach.