brain-microstructure-exploration-tools / abcd-microstructure-pipelines

Processing pipelines to extract brain microstructure from ABCD Study dMRI
Apache License 2.0
0 stars 1 forks source link

Settle on a pipeline framework #2

Closed ebrahimebrahim closed 8 months ago

ebrahimebrahim commented 8 months ago

There is pydra which is supposedly a successor to nipype. We should consider this and other options for our pipeline framework. There are many options out there. We should also keep in mind the requirement that each step of the pipeline be easily turned into a command line call that can in turn easily be wrapped into a Slicer command line module.

ebrahimebrahim commented 8 months ago

Other ones to consider are Nipype and ruffus. Do we want something specific to neuroimaging or is a generic thing better

allemangD commented 8 months ago

Some thoughts copied from Google Spaces discussion for posterity:

(@allemangD) As far as pipelining our own work... generally I like the ethos of Ruffus but I'm hesitant to pull it in from the lack of recent development. Of the options I've seen it looks like pydra is better suited, better maintained for recent versions of Python etc, even if it is maybe a heavier library.

If we want to keep things lightweight, though, the fundamentals of invoking tasks in order based on timestamps is not really all that complex - graphlib and subprocess are builtin modules. The benefit of Pydra is it already has abstractions for different compute backends (function call, subprocess, docker, etc). In theory someone could compose one of our workflows into their larger Pydra workflow but I'm not sure how much of a selling point that lock-in really is if we have a decent API.

In short my sense is to either keep things simple and just do it ourselves based on timestamps, or else just grab Pydra since it's actively maintained by the nipype folks.

(@ebrahimebrahim) I'd like to better understand exactly what it is that pydra adds to "just doing it ourselves". If it's adding enough savings in overhead and making our code more maintainable and readable by others, then based on what you're saying I am leaning to pydra

(@allemangD)I have one more general question that I think informs the decision: what do we want the API(s) to look like?

If we adopt a particular pipelining tool at the library level, then the API must be that pipelining tool. I don't think that's a good option, especially not with Ruffus.

If we produce a set of CLI tools, the API may be a shell, makefile, or any pipelining tool that supports CLI steps (most of them). We could even include some "pydra compat" functions that produce appropriate pydra steps.

If we produce a set of plain Python functions, the API is plain Python. Thin CLI entrpoints with argparse or click is trivial. Creating pydra compat is trivial - Pydra supports python-function steps out of the box.

From what I can tell, the only thing we really give up with plain CLI and/or plain Python is caching. It's not hard to stat a file: unless the user passes a flag to ignore cache, if the output's creation time is after the input's modified time, do nothing. I think this solves most cases with minimal overhead, and leaves room for the tool to be consumed by any task runner. If it turns out we do need something from eg Pydra, we already have those plain-python-functions that can be easily added to a Pydra pipeline.

(@ebrahimebrahim) I love the idea of plain python API fundamentally with Pydra used for putting together pipelines. I don't think we should do the caching and stat stuff ourselves -- it seems like too much of an already-solved and standard problem.

(@allemangD) If we added that logic, it would probably belong in a thin CLI entrypoint. I wouldn't expect a plain python call to secretly be a no-op based on the filesystem, and it would probably complicate Pydra or other runners usage.

For reference on what I mean by "entry point": https://setuptools.pypa.io/en/latest/userguide/entry_point.html

And "thin CLI wrapper" would probably be easiest with click, but we could do it with argparse if we wanted to avoid the dependency. https://click.palletsprojects.com/en/8.1.x/

Just point the entrypoint to a click-decorated function that loads files, calls the appropriate python library function on the data, then writes files back out to disk.

ebrahimebrahim commented 8 months ago

It sounds like we are leaning towards:

Pushing the pydra integration toward the end, but maybe also trying it out early on just so we make sure we haven't missed anything about what demands it places on the design

allemangD commented 8 months ago

I'm closing this issue and moving forward with https://github.com/brain-microstructure-exploration-tools/abcd-microstructure-pipelines/issues/2#issuecomment-1999881936 as the near-term plan. I'll be sure that the plain python API is compatible with Pydra - if there are any red flags I'll create a new issue, or re-open this one if there's some dealbreaker.

ebrahimebrahim commented 7 months ago

Revised plan per our meeting today:

So the current plan is to