We can build a graph as a Python dict on our own and have dask create the graph and execute it across multiple threads, processes, or nodes as distributed
This will serve as a thread for thoughts on the comparison for this project
Many, many other scientific communities have data processing pipeline softwares they use (e.g., dozens from bioinformatics, nipype in neuroimaging, etc)
Many of these pipeline approaches take a very file-oriented approach, using the existence of files for dependency checking and writing out files for task outputs
Dask better suited to computation pipeline where we only IO at start or end
Dask "delayed" could wrap up large swaths of imperative code (the tasks I want to execute) and chain them together, but should also allow for dask graphs within a task?
Supporting a diversity of executors (sync, thread, process, distributed) very important to support diversity of parallelization strategies
I might see this code run on our cluster where I'd use distributed from 1 node connecting to 100s of workers
Someone else might run it on our cluster but use many single tasked, synchronous jobs that they kick off themselves (ncpu=32, ntask=32, task/cpu=1)
It's probably easier on some environments to submit fewer jobs or otherwise run the CLI script on fewer nodes, but parallelize internally to all the CPU slots on each node (ntask=32 and ncpu=2, task/cpu=16)
Currently using another library to toposort task dependencies (after turning them into keys) before creating the delayed pipeline, but might revisit
Considering two frameworks for turning a sequence of tasks specified by the user into a compute graph that can be executed (item 3 in #90):
luigi.run()
to run a set of tasksThis will serve as a thread for thoughts on the comparison for this project