dchackett / taxi

Lightweight portable workflow management system for MCMC applications
MIT License
3 stars 1 forks source link

Separation of taxi/dispatcher concepts #22

Open dchackett opened 7 years ago

dchackett commented 7 years ago

As it stands, the code is not the most faithful representation of a "taxi-dispatcher" model. The naive/direct implementation of the taxi-dispatcher model would have a dispatcher program that delivers orders to taxis upon being queried. The dispatcher contains all logic about which task to run next, dependencies, etc. The taxis are unintelligent and do the job they're assigned.

The functionality of the taxi and dispatcher are currently mixed together in taxi.py. Dispatcher functionality includes determining the next job to run, marking jobs complete/etc. Taxi functionality is: run task; report task is done; ask for new task.

We can separate these out better by implementing a "Taxi" class and a "Dispatcher" class. Each 'taxi.py' will create an instance of each class. The Taxi class handles all the control logic/"main loop", and knows how to call scripts, etc. The Dispatcher class handles all of the book-keeping for the SQL DB, determining the next task to run, etc.

The Dispatcher class can also solve another structural flaw by providing an encapsulating interface for the backend SQL database.

There are still some conceptual issues with having each taxi have its own instance of dispatcher, but maybe the way to think about this is as each taxi having an interface with the abstract/distal dispatcher. (c.f., taxis call the dispatcher using phones? Don't name anything phone.)

etneil commented 7 years ago

I am now digging in to this on the "modularize" branch, including a couple of other classes as we've discussed offline: "Pool" will handle the set of available Taxis, their properties and status, and interfacing with the queue. "BatchQueue" abstracts an interface to the queueing system, currently handled by local_taxi.py.

We also discussed a "DBBackend" class, but so far I'm not seeing any nice way to do that (after staring at the code, it seemed to me like this class would pretty much just be a re-implementation of pieces of SQLAlchemy.) Instead, my working model is that the parts of Pool and Dispatcher that talk to the backend will be abstract interfaces, to be filled in by child classes. I've begun to implement "SQLitePool" as an example of how this works.

dchackett commented 7 years ago

I am hesitant about this approach, because it may violate one of the design principles: this should all be usable straight out of the box with virtually no thinking about how it works or what it's doing. I much prefer the plugged-in backend model just because there can be a default and nobody ever thinks about it unless they want to change it, at which point they've already been thinking about what's going on in the backend.

I agree that we would have to duplicate limited functionality from SQLAlchemy, but that's the price paid for Python 2.6.6 compatibility (unless there's some version of it available for 2.6.6?)

I think any DBBackend object only needs to implement the following methods:

etneil commented 7 years ago

Point well-taken about simplicity and the DB backend organization, although with a common interface, the only place the end user should have to care is one line where they instantiate a SQLitePool or a PostgresPool or HDF5Pool or whatever. And since it's the same line where they will have to specify the remote connection info or file path anyway...

I partly went in this direction after looking at some of the existing SELECT queries with non-trivial filters on them in taxi.py, since they seemed difficult to fit into the pattern you outline above (especially with appropriate input sanitizing.) But maybe with further thought it can be made to work nicely.

dchackett commented 6 years ago

We may be able to move entirely over to the paradigm of storing objects (Taxis and Tasks) in some DB-specific format, then reconstructing those objects immediately so that more abstract Pool and Dispatcher methods can work with them. For instance, when Dispatcher queries from the Dispatch DB, it renders the rows from the DB in to reconstructed Task classes before any other method in Dispatcher sees them. After a taxi has run the task, the completed task object is reencoded and stored back in the DB. If we were to switch entirely over to using this scheme, we could implement a general backend that reconstructs objects from DB rows/whatever.

This would come with some limitations. Specifically, we wouldn't be able to do fast one-off queries like the last minute race-condition check on whether another taxi has claimed a task. Everything would have to be object check-outs (read from DB) and check-ins (write to DB). Additionally, a completely abstract backend might require pool and dispatcher DBs to have more structure in common than is graceful. There are also some tables presently, like "imports" in the Dispatcher DB, that don't neatly fit in to this scheme (although they could likely be shoehorned in).