Settle on a pipeline framework

ebrahimebrahim commented 8 months ago

There is pydra which is supposedly a successor to nipype. We should consider this and other options for our pipeline framework. There are many options out there. We should also keep in mind the requirement that each step of the pipeline be easily turned into a command line call that can in turn easily be wrapped into a Slicer command line module.

ebrahimebrahim commented 8 months ago

Other ones to consider are Nipype and ruffus. Do we want something specific to neuroimaging or is a generic thing better

allemangD commented 8 months ago

Some thoughts copied from Google Spaces discussion for posterity:

(@allemangD) As far as pipelining our own work... generally I like the ethos of Ruffus but I'm hesitant to pull it in from the lack of recent development. Of the options I've seen it looks like pydra is better suited, better maintained for recent versions of Python etc, even if it is maybe a heavier library.

If we want to keep things lightweight, though, the fundamentals of invoking tasks in order based on timestamps is not really all that complex - graphlib and subprocess are builtin modules. The benefit of Pydra is it already has abstractions for different compute backends (function call, subprocess, docker, etc). In theory someone could compose one of our workflows into their larger Pydra workflow but I'm not sure how much of a selling point that lock-in really is if we have a decent API.

In short my sense is to either keep things simple and just do it ourselves based on timestamps, or else just grab Pydra since it's actively maintained by the nipype folks.

(@ebrahimebrahim) I'd like to better understand exactly what it is that pydra adds to "just doing it ourselves". If it's adding enough savings in overhead and making our code more maintainable and readable by others, then based on what you're saying I am leaning to pydra

(@allemangD)I have one more general question that I think informs the decision: what do we want the API(s) to look like?

If we adopt a particular pipelining tool at the library level, then the API must be that pipelining tool. I don't think that's a good option, especially not with Ruffus.

If we produce a set of CLI tools, the API may be a shell, makefile, or any pipelining tool that supports CLI steps (most of them). We could even include some "pydra compat" functions that produce appropriate pydra steps.

If we produce a set of plain Python functions, the API is plain Python. Thin CLI entrpoints with argparse or click is trivial. Creating pydra compat is trivial - Pydra supports python-function steps out of the box.

From what I can tell, the only thing we really give up with plain CLI and/or plain Python is caching. It's not hard to stat a file: unless the user passes a flag to ignore cache, if the output's creation time is after the input's modified time, do nothing. I think this solves most cases with minimal overhead, and leaves room for the tool to be consumed by any task runner. If it turns out we do need something from eg Pydra, we already have those plain-python-functions that can be easily added to a Pydra pipeline.

(@ebrahimebrahim) I love the idea of plain python API fundamentally with Pydra used for putting together pipelines. I don't think we should do the caching and stat stuff ourselves -- it seems like too much of an already-solved and standard problem.

(@allemangD) If we added that logic, it would probably belong in a thin CLI entrypoint. I wouldn't expect a plain python call to secretly be a no-op based on the filesystem, and it would probably complicate Pydra or other runners usage.

For reference on what I mean by "entry point": https://setuptools.pypa.io/en/latest/userguide/entry_point.html

And "thin CLI wrapper" would probably be easiest with click, but we could do it with argparse if we wanted to avoid the dependency. https://click.palletsprojects.com/en/8.1.x/

Just point the entrypoint to a click-decorated function that loads files, calls the appropriate python library function on the data, then writes files back out to disk.

ebrahimebrahim commented 8 months ago

It sounds like we are leaning towards:

pure python api as the foundation
thin cli wrappers for the python api, via click
cli entrypoints to end-to-end pipelines with smart caching and all, via pydra

Pushing the pydra integration toward the end, but maybe also trying it out early on just so we make sure we haven't missed anything about what demands it places on the design

allemangD commented 8 months ago

I'm closing this issue and moving forward with https://github.com/brain-microstructure-exploration-tools/abcd-microstructure-pipelines/issues/2#issuecomment-1999881936 as the near-term plan. I'll be sure that the plain python API is compatible with Pydra - if there are any red flags I'll create a new issue, or re-open this one if there's some dealbreaker.

ebrahimebrahim commented 7 months ago

Revised plan per our meeting today:

We are ditching pydra. The documentation is obtuse, so it doesn't make it easier for users to create workflows. It isn't handling caching. And we can handle caching; that is not difficult.
Our replacement for pydra: nothing. Just use plain python to the extent possible. We thought about creating our own task abstraction and using python's built-in graph sorting to achieve parallelism, but the performance gain is minimal compared to just using multiprocess.Pool to internally parallelize individual steps over many subject cases. And the loss is huge for transparency and maintainability of code and making it easy for users of the API to compose their own workflows. Nothing beats plain python for that.

So the current plan is to

achieve caching by our own decorator. #17
achieve parallelism by whatever means makes sense internally to each component function we create. So for example many steps can simply use multiprocess.Pool. Other steps like the HD-BET mask generation can do something else. This way we lose out on global parallelism of the end-to-end pipeline, but that is okay because (a) there are bottlenecks in the pipeline that make end-to-end parallelism not that much better than per-component parallelism, and (b) we can always go back and combine components into better-parallelized chunks later on if we want to.
achieve workflow definitions via plain python. Just a function that calls a series of other functions.

brain-microstructure-exploration-tools / abcd-microstructure-pipelines

Settle on a pipeline framework #2