framed-data / overseer

Overseer is a library for building and running data pipelines in Clojure.
Eclipse Public License 1.0
97 stars 10 forks source link

First pass at priority-based job selection #27

Closed andrewberls closed 9 years ago

andrewberls commented 9 years ago

This adds the ability for users to specify job priorities at insertion time, such that higher priority jobs are run before others.

The job selection algorithm is modified such that out of all possible jobs that are ready to run, instead of selecting a random one, jobs will first be selected on the basis of manually-specified priority, if present. Priorities are completely optional; if unspecified, jobs will now be selected based on their creation date, oldest jobs first.

Two new attributes are added to the core Datomic schema:

:job/priority - An optional numeric priority. Lower numbers = higher priority, so 0 = top priority.

:job/created-at - The time the job was inserted. This seems useful for a wide variety of status/introspection functions, and is used first for the date-based priority selection.

To specify priority, the user simply adds an attribute in api/->graph-txn:

(overseer/->graph-txn my-job-graph {:job/priority 0})
elliot42 commented 9 years ago

Can we talk about what prompted this?

I don't think introducing the notion of priority into the system is the right solution for occasionally needing to run individual customers right now, which we shouldn't need to do anyway.

Furthermore the property of selecting the oldest job first means that in a pool of N workers, they are all going to try to hammer the same oldest job instead of spreading the work out around the pool of jobs that are ready. This breaks the ability of the system to scale out properly.

elliot42 commented 9 years ago

The system is intentionally not an ordered priority queue, and if we wanted one of those, we shouldn't be reinventing it with this system.

andrewberls commented 9 years ago

This was just a for-fun on a plane. Good point about the date stuff; it's pretty trivial to go back to the random selection when manual priority is not present. And so far, running individual customers is an ongoing occurrence which is what brought it to mind.