framed-data / overseer

Overseer is a library for building and running data pipelines in Clojure.
Eclipse Public License 1.0
97 stars 10 forks source link

(Experimental) Store job args in EDN #42

Closed andrewberls closed 8 years ago

andrewberls commented 9 years ago

Previously, any user-specified arguments to jobs needed to already exist as attributes in Datomic. This is undesirable for several reasons, namely that it is inflexible, and frustratingly slow to iterate/change as all possible attributes have to be enumerated and transacted into the DB. Furthermore it unnecessarily couples that aspect of the system to Datomic, although backend indifference is not a design goal yet.

It is much better to store arbitrary job arguments in a single system-defined EDN attribute, as arguments were just maps of primitive datatypes anyways. This gives total flexibility to present and future callers. The amount of code required to accomplish this in Overseer is surprisingly small, although it will cause breaking changes in all callers (both at job injection site and handlers) that will require work to fix before rollout.

This is labeled as experimental because it is a low priority issue and intended to serve for discussion purposes - for example, is modifying the signature of :process handlers to be [job args] the right call? All jobs have :job/args stored on them, but it seems lame to require users to destruct/parse that themselves. Additionally, the output of pre-process is no longer passed into process as in practice harnesses serve the purpose of providing additional context to handlers, and pre-processing should really be for setting up pre-requisite state (and requiring the user to return a job object is strange)

elliot42 commented 9 years ago

In order to make this not a completely chaotic breaking change, I would consider, at least temporarily, using the fact that a handler declaration can be a map, and having the contract be that :process entries conform to the old contract, and :process-edn (or some other more appropriate name) be the new contract. Then the transition process is:

  1. enable new process-edn support
  2. transition all old job handlers to process-edn
  3. when there are no users of process, switch process over to process-edn contract (at which point both process and process-edn have the same contract)
  4. rename all process-edn handlers to process handlers
  5. remove support for process-edn
elliot42 commented 9 years ago

Also if args wasn't a map, but was instead a vector, then the function signature of job handlers can be literally anything you want, because you would then be able to treat it like:

(apply handler system job args)

So all your function signatures could be:

(defn handler [system job any other args you want]
  ...)
andrewberls commented 9 years ago

I sort of disagree on strategy - I don't think rolling out temporary code in the framework is the right call. I'm willing to bet that we can:

1) Pretty trivially modifying existing process calls to be [job args]

2) Change our job arg accessors to be like (job.util/->stream db args) I think we're lucky in that we have a very small/self-contained number of accessors and that the strong emphasis on run-standalone means the footprint here should be pretty small

3) Change job injection to use appropriately-shaped args - entity ids and so on

Of course all of this can and should be extensively tested locally before deployment, so all in all that's why I don't think introducing temporary code into the framework is the right idea.

elliot42 commented 9 years ago

What are you saving by doing that? The framework is not open source, there is zero cost to having both things work at once, and there is a bunch of complexity and risk of breaking everything by having the strategy require a cross-repo atomic change-everything-at-once switch. It is far more likely that we will break everything at once in multiple places having to change all of this than it will just work perfectly in all places in one switch.

andrewberls commented 9 years ago

They are just different strategies. In my mind, a partial rollout with process-edn represents a bunch of code churn across both the framework and the app as well as more error footprint (does the actual implementation work correctly? and does the temporary strat code work correctly?)

vs the other strategy, which is basically akin to chasing down compile errors until you have a unit you are confident in. The reason I'm advocating for this is because we can test this thing to basically an arbitrary degree of sophistication entirely locally - injecting and running a comprehensive job graph against a local db until it succeeds end to end. In which case I have the same degree of confidence in the end result, without the code churn in the framework itself.