dchackett / taxi

Lightweight portable workflow management system for MCMC applications
MIT License
3 stars 1 forks source link

Smarter taxi management -- Trunks #10

Closed dchackett closed 6 years ago

dchackett commented 7 years ago

Currently, new taxis are launched to work on a dispatch when certain "spawn" tasks are reached in the job forest. If the taxi subsequently dies unexpectedly, relaunching it is tedious.

Instead, we can have each taxi check how many taxis should be working on a dispatch at any given time. This is easily accomplished by marking some tasks as "trunk" tasks (as in, a trunk or subtrunk of a tree in the task forest, versus a small branch). Then, the number of taxis that should be running is simply the number of active "trunks" (i.e., number of trunk tasks that are ready or active).

Trunk tasks are, in general, the tasks that take up the bulk of the run time. In lattice applications, this will almost always mean HMC/gauge generation tasks. It could also be applied to, say, some long-running measurement task like computing overlap propagators.

Tasks are already marked as trunk or non-trunk ("trunkliness") in the abstract Task classes, which the "dispatch" class uses to figure out where to put "spawn" tasks while compiling. However, trunkliness is not presently stored in the dispatch DB file.

If trunkliness is stored in the dispatch DB file, then taxis just need to occasionally (between each task) count the number of active trunk tasks and launch new taxis as necessary. "Spawn" tasks will no longer be necessary.

This will also simplify dispatch compilation, and make dispatches slightly less "implementation-specific".

etneil commented 7 years ago

This is a great idea and will vastly improve the reliability of leaving Taxi running hands-off. Currently if you don't check and relaunch things by hand, the number of taxis will slowly decay to zero if there are job failures for one reason or another.

dchackett commented 6 years ago

Implemented in the present taxi v0.2 code.

There is some ambiguity about what to do in the case where there is a trunkless task forest. This occurs when e.g.: performing a batch of measurements and there is a flat/trivial dependency structure (e.g., running spectroscopy on a bunch of pre-existing gauge files); or, all trunk tasks have been completed and only wrap-up measurement tasks and copy tasks remain. In this case, taxi will launch up to as many taxis as there are ready tasks (but will not create any additional taxis in the pool, which limits the total number of taxis that can be submitted to the queue).