Distributed Computation

A natural extension for labtech would be to support running tasks across multiple machines. This will require careful design and some refactoring of existing code in lab.py, so I'm starting this issue as a place to start logging ideas and plans.

The most important design goal is to make this approachable for researchers who have never had to do distributed computing before. Giving them tutorials and infrastructure-automation scripts to get up and running as easily as possible will be a big win.
To my knowledge, most data-focused distributed computation libraries (e.g. Dask, Spark) are about scaling operations against a single large dataset, whereas our goal would be distribute many individual tasks across machines.
A better fit is likely leveraging a task queue like:
- Ray looks like a strong contender
- Celery
- Dramatiq
- Huey

The architecture could look something like this:

flowchart LR
C(Coordinator) --> Q[Task Queue: rabbitmq, redis]
Q --> W1(Worker)
Q --> W2(Worker)
Q --> W3(Worker)
W1 <--> S[Shared Storage: NFS, S3]
W2 <--> S
W3 <--> S
S --> C

We'll need to be careful around the guarantees of task completion given by different task queue libraries. One simplifying factor is that we can have a single Coordinator node that can be responsible for ensuring dependency tasks are completed before scheduling dependent tasks.
The Storage backend would need to be shared across workers, so storage backends like an NFS share or S3 bucket would be needed (unless the NFS share was already mounted in each worker's filesystem by the user).
Ideally, there should be minimal configuration required on each worker. It does seem like the Task definition and Lab (including context) construction will need to be present on each worker. It would be nice if the coordinator could send all that over the wire though.

ben-denham / labtech

Distributed Computation #21