Topic: pipeline framework

OK so long story short... a few observations:

Many, many other scientific communities have data processing pipeline softwares they use (e.g., dozens from bioinformatics, nipype in neuroimaging, etc)
Many of these pipeline approaches take a very file-oriented approach, using the existence of files for dependency checking and writing out files for task outputs
Dask better suited to computation pipeline where we only IO at start or end
Dask "delayed" could wrap up large swaths of imperative code (the tasks I want to execute) and chain them together, but should also allow for dask graphs within a task?
Supporting a diversity of executors (sync, thread, process, distributed) very important to support diversity of parallelization strategies
- I might see this code run on our cluster where I'd use distributed from 1 node connecting to 100s of workers
- Someone else might run it on our cluster but use many single tasked, synchronous jobs that they kick off themselves (ncpu=32, ntask=32, task/cpu=1)
- It's probably easier on some environments to submit fewer jobs or otherwise run the CLI script on fewer nodes, but parallelize internally to all the CPU slots on each node (ntask=32 and ncpu=2, task/cpu=16)
Currently using another library to toposort task dependencies (after turning them into keys) before creating the delayed pipeline, but might revisit

ceholden / yatsm