Design, implement, and utilize an object-oriented pipeline architecture based on prior reference

vijayiyer05 commented 2 years ago

The Janelia implementation of the Allen Institute's pipeline is the suggested prior reference architecture: https://github.com/jenniferColonell/ecephys_spike_sorting

As discussed at the kick-off meeting, this issues captures the "abstract" component of the implementation whereas #2 captures the first concrete implementation.

For this issue, you might consider the "pipes and filters" ideas from the design patterns community (mostly Java programmers in my experience). You might also consider ideas from the dataflow programming community.

As mentioned, I could also imagine a good solution as being quite lightweight and straightforward:

A "pipeline stage" abstract class defining common properties & methods (e.g. run()) for each stage
A "pipeline controller" class representing the state and providing the vehicle for automation (e.g. runAll())

vijayiyer05 commented 2 years ago

@Yambottle, @iamamutt, @dimitri-yatsenko...

Just briefly capturing some of the first observations from our review meeting last week:

Overall package name should be more identifying, e.g. +neuropixels
For the Use Case Diagram, could depict the possibility for N probes, not just 1 or 2

@apjanke and I had a good further discussion this week. Stay tuned for more feedback shortly!

yambottle commented 2 years ago

@Yambottle, @iamamutt, @dimitri-yatsenko...

Just briefly capturing some of the first observations from our review meeting last week:

Overall package name should be more identifying (see Andrew's comments) ~~e.g. +neuropixels~~

For the Use Case Diagram, could depict the possibility for N probes, not just 1 or 2

@apjanke and I had a good further discussion this week. Stay tuned for more feedback shortly!

Thanks Vijay! I've updated these points that you suggested last time. I'll update the pending PR tomorrow or early next Monday.

apjanke commented 2 years ago

I'd recommend using something like +npxtoolkit instead of +neuropixels as the top-level package name, because Neuropixels themselves (the probe manufacturer, not the authors of the Neuropixels Utils library) publish their own Matlab code too, and we should probably let them keep their main name to themselves.

The two important things about the top-level package name are:

It should try to be unique across the whole world of Matlab code to avoid collisions, since the Matlab package namespace is a single global one, and who knows what other code a user of your library might be pulling in from elsewhere? There's no central name registry for Matlab M-code package names to coordinate this name selection, so you just have to guess and do your best. Try to pick a name that nobody else would choose on their own. I bet the Neuropixels probe manufacturer might well pick +neuropixels.

The first couple characters should be as unique and distinguishing as possible, amongst all global Matlab identifiers (the set of package names, function names, and class names), to make tab-completion work well for users. (So your users can just type in a short prefix, hit tab, and find your top-level package name quickly.) This is important because Matlab's import statement is function-scoped, not file-scoped, so it doesn't get used all that much, and people using package-qualified Matlab code end up typing in the package name a lot.

The "ne" prefix has a ton of matches:

The "np" prefix name space is wide open:

yambottle commented 2 years ago

Thanks for your suggestion, Andrew! I'll update the package name.

apjanke commented 2 years ago

Hi, folks! A couple thoughts on the initial design:

Minor things

In the Class Diagram:

The Config -> Pipeline relationship is labeled as (1, 0..1). Should that be 1-to-* instead, since a Config can produce multiple Pipelines? Or should I expect that the launch.m function will have multiple Config objects, one per Pipeline that it is producing?

I read the Use Case diagram as saying that a Pipeline can have multiple Stages, and a Stage can have multiple probes, which I assume correspond to Jobs. I think that means the "->" associations in the class diagram form Pipeline -> Stage and Stage -> JobBase should be 1-to-*, not 1-to-0..1.

Or does a Stage only have one Job, and one Job can act on multiple Probes?

The top-level package name in the Class Diagram should probably be +npxtoolkit or something like that, not +src. Typically, src/ is just a directory in the repo or distribution under which the actual packages live.

Design Patterns

Like Vijay mentioned related to pipes and filters, I think Gang of Four style Software Design Patterns are a useful framework for thinking about this design. Looking at the initial class design at https://github.com/vathes/neuropixels-toolkit, I'd say this employs the Strategy pattern - a Job object (like CatGT, KiloSort, etc) is effectively a Strategy that is plugged in to a sequence in the Stage object - and a Stage effectively follows the Visitor pattern to fire off a sequence of jobs.

I don't think this implies any changes to do now; it's just a useful way of thinking about and discussing things.

Assembly class

In the use case diagram, it looks like the process of fanning processing out to multiple pipelines is called "assemble". There's no corresponding class in the class design. I think adding an Assembler or Assembly class that manages multiple, possibly concurrent, pipelines would be a good thing to add. The initial implementation of the "assemble" operation might be just a simple parfor loop. But you might want to do something more sophisticated later, like using asynchronous Parallel Computing Toolbox tasks for the pipelines so that the main controller logic can monitor their progress, provide management controls, have more complex controls over parallelism behavior. If you start out with an Assembly class in the initial design that wraps the pipeline-parallelism logic, you could start out with a parfor and then add support for more sophisticated controls later, without introducing a change to the neuropxels-toolkit public API design that users/clients would have to deal with.

Or am I misreading that Use Case diagram? Does it mean that the assemble() operation is what produces a Pipeline from a Config in the first place?

The fact that the arrow from "make Config objects" fans out to multiple Pipelines - does this mean that multiple pipelines exist and can run in parallel in one Matlab project?

Logging

This is a relatively complex project, and it involves incorporating the library code and maybe user-supplied code, and runs nontrivial batch processing. I think you might want to have some logging in here.

Matlab doesn't provide a logging framework, so you'll need to bring your own. You could roll a custom one from scratch, or you could take an existing FLOSS Matlab logging library like log4m or my own SLF4M.

log4m is a simple, kind of minimal logging framework written in native Matlab.

SLF4M is a more powerful, but more heavyweight framework that sits on top of the Java SLF4J and log4j logging libraries. This could be an issue: this means that by using SLF4M, you're introducing a dependency on Java. Not a big deal now, but MathWorks has announced that future versions of Matlab will not ship with a bundled Java JVM, so users will have to install and reference their own. (This event is called "Jexit".) Given that, you may not want to make neuropixels-toolkit dependent on Java, so SLF4M is probably the wrong choice.

Using third-party libraries in other libraries in Matlab is kind of a challenge: Matlab doesn't have a package manager or dependency management tool for handling this. And you definitely don't want to tell your users "also go install this logging library and get that on your Matlab path before neuropixels-toolkit" will work. So you'll want to redistribute the logging framework library with neuropixels-toolkit itself by incorporating its source code into the neuropixels-toolkit repo. (This is sometimes called "vendoring" a dependency.)

Just grabbing the log4m library and including it in a lib directory with neuropixels-toolkit presents another problem, though: Now neuropixels-toolkit includes a specific version of log4m, which could cause problems for any other Matlab code that also uses log4m and that users want to run in the same Matlab process. I think the best way around this is to repackage the library into an internal library for your library's own use: Take the log4m (or other) source code, and move it into your own src/ tree under a +npxtoolkit/+internal/+logging package under your own package namespace and adjust the references to it. Then there won't be any conflicts, because two different vendored versions of log4m would be separate, independent pieces of code.

I see there's a logs() method in the Util class. How's that intended to behave?

apjanke commented 2 years ago

A note on GitHub workflows: The https://github.com/vathes/neuropixels-toolkit repo is a GitHuub fork of this neuropixels-toolkit repo. You might want to use a non-fork repo as the main place for your in-progress work, so you can make a bunch of commits without having to worry about keeping your commit history tidy, and then use a fork repo as a place for "staging" Pull Requests with a tidier commit history and organization.

vijayiyer05 commented 2 years ago

Thanks to @apjanke for all his input!

I quite like the idea of having an Assembly class, i.e. adding one more level of hierarchy to the class diagram. It might be a place to implement parallel computing aspects as Andrew notes. For this cycle's goals, it might be primarily scaffolding. One thing to consider is whether adding such a class might replace the need for a static utilities class. Note the Assembly class could have static methods, in addition to per-assembly methods.

Regarding the logging suggestions, I guess this came up reacting to the logging function noted in the current utility static class. That's beyond the scope of the current agreement, though we coudl consider it as another option for the stretch goals this cycle. There are various tools to build from for implementation, as Andrew mentions. Besides this, one more general thing to consider is whether logging implementation would be distributed across the various classes. If so, this might figure into the class design. e.g. each class might have a log() function, and then the (proposed) Assembly class could roll these up.

yambottle commented 2 years ago

Thanks @vijayiyer05 @apjanke for all your suggestions! I've made several changes in the above PR, it responses to most of your requests:

Change package name from +neuropixels to +npxtoolkit
ClassDiagram 1 - 1...0 relationship should be 1 - *, also updated UsecaseDiagram to resolve some confusion
Add an Assembly class as the top level object, also changed Pipeline to Session in order to avoid concept conflicts
Now the object level will be Assembly->Session->Stage->JobBase
Change function assemble() to add_session() to avoid confusion
Include log4m at +npxtoolkit/+internal/+logging/

Maybe you want to take a look at the diagrams again at first.

There are few requests that I haven't done yet:

Making a non-fork repo for in-progress work(I'll do this today)
Parallelism: I was thinking using the parfor at the very beginning, although I'm temporarily using only for in the current implementation. I'll switch to Parallel Computing Toolbox after the Jobs implementation.
Logging: I'll make logs while implementing specific Jobs. Is there any logging message format/standard/template that you require?

apjanke commented 2 years ago

Cool!

Can you talk a bit about why you chose the name "Session" instead of "Pipeline"? I actually thought "Pipeline" was a pretty good name; maybe I'm not familiar enough with the domain terminology to understand why "Session" is better.

All the other changes you've made so far look straightforwardly good to me.

About the "non-fork repo" thing: If you just email GitHub Support and ask them to "convert my vathes/neuropixels-toolkit repo from a fork to a 'canonical' repo" they can just flip a switch on the back and and your repo becomes a non-fork repo with zero work for you. Then you can create a separate "PR staging" fork repo later when you're ready to send code upstream.

About logging: The big thing you need to decide on is what hiercarchy of logger names you're going to use for your messages. Often people just pick package-qualified class names of objects sending the messages as the logger names. But that's a bit of an issue for a reusable library like npxtoolkit: some classes are part of the public API, but some are going to be internal implementation details, and you don't want to leak those names via logger names. So you might want to choose logger names by mapping private/internal classes on to classes that are part of the public API, or just define a logger name hierarchy that's similar to but independent of your package & class name hierarchy.

apjanke commented 2 years ago

Oh, I totally overlooked something: naming conventions for class/method/property names!

In the class diagram you have, you're using snake_case for method and property names. What little contemporary OOP Matlab stuff I've seen lately seems to have settled on Java-style camelCase for method and property names. That's also what the existing Neuropixel Utils code is currently using. Maybe switch to camelCase? For example, "stageInfo", "currentJob", "addStage" instead of "stage_info", "current_job", and "add_stage".

yambottle commented 2 years ago

@apjanke Thanks for following up!

"Session" or "Pipeline" naming: Maybe this is a question for @vijayiyer05 answering from a user perspective. I'm good with either of them, Jennifer's ecephys pipeline is using session(Pipeline -> Session -> Probe -> Stage), so I thought this name might be better. There are few other options as well:
1. Assembly -> Session -> Stage -> Jobs
2. Assembly -> Pipeline -> Stage -> Jobs
3. Pipeline -> Session -> Stage -> Jobs
4. Assembly -> Workflow -> Stage -> Jobs
Non-fork repo: Good to know we can do it in that way!
I'll work on the non-fork repo and camelCase names

djoshea commented 2 years ago

@apjanke definitely apologize for the hodgepodge of camelCase and snake_case throughout the original neuropixel_utils code. It's a bit of my switching rapidly between Python / Julia and MATLAB showing through.

Glad to see so much progress on a (far better designed) neuropixel toolkit!

yambottle commented 2 years ago

Hi @apjanke ,

About the 'non-fork' repo, apart from the tidy commit history, is there any other concerns that you have?

If tidy commit history is the only concern, I think squash and merge may be the best solution for us.

vijayiyer05 commented 2 years ago

Good question!

I think we can favor what we think is clearest, rather than mirroring the names from the reference architecture.

For me "Pipeline-->Stage" seems for sure the clearest for the middle two levels. I actually think "Session" might work better than "Assembly" at the outer level, but I could go either way.

On the inner level, I suggest going with "Task" over "Job". The parallelization implementation is TBD later -- could be with parfor, parfeval, or batch submission. If it uses uses the batch submission approach, these independent calculations would correspond to the task construct in Parallel Computing Toolbox.

@apjanke Thanks for following up!

"Session" or "Pipeline" naming: Maybe this is a question for @vijayiyer05 answering from a user perspective. I'm good with either of them, Jennifer's ecephys pipeline is using session(Pipeline -> Session -> Probe -> Stage), so I thought this name might be better. There are few other options as well:

Assembly -> Session -> Stage -> Jobs

Assembly -> Pipeline -> Stage -> Jobs

Pipeline -> Session -> Stage -> Jobs

Assembly -> Workflow -> Stage -> Jobs

Non-fork repo: Good to know we can do it in that way!

I'll work on the non-fork repo and camelCase names

yambottle commented 2 years ago

Thanks for your update! @vijayiyer05

Sure, I'll finally go with "Session->Pipeline->Stage->Task". I'll check the batch submission approach as well.

apjanke commented 2 years ago

About the 'non-fork' repo, apart from the tidy commit history, is there any other concerns that you have?

After having a closer look at how Dan's own neuropixels-toolkit repo is set up, I think my concerns here were misplaced, and actually a fork repo is just fine. I had gotten the djoshea/neuropixels-toolkit and djoshea/neuropixel-utils repos confused, and thought that this vathes/neuropixels-toolkit was intended to be a major reorg of the neuropixel-utils repo under the "Toolkit" framework and would eventually produce a separate publishing destination, but actually the djoshea/neuropixels-toolkit repo is itself a separate repo set up specifically to receive the PRs from this one, and it will be the canonical destination for presenting the finished work to the public. So having vathes/neuropixels-toolkit be a fork of djoshea/neuropixels-toolkit is perfectly appropriate, and "squash and merge" is a good approach. Nothing to see here; move along. Sorry! :)

apjanke commented 2 years ago

Sure, I'll finally go with "Session->Pipeline->Stage->Task"

I like this naming choice, too. Makes good sense to me from a purely programmer's perspective.

apjanke commented 2 years ago

Oh! I just ran in to this article yesterday, which explicitly lays out some things about logging that I agree with, but were just inarticulately sitting around in the back of my head. Have a read through and tell me what you think.

https://tuhrig.de/my-logging-best-practices/

From the perspective of Neuropixels Toolkit, the researchers/neuroscience people running it are the "business" users who'll be reading the INFO-level logs, and the developers working on Neuropixels Toolkit/Utils itself, or IT staff supporting researchers' use of the Neuropixels-related code, are the "technology" people who'll be reading DEBUG-level logs.

The big exception here is that because npxtoolkit stuff will be dispatching long-running tasks in interactive Matlab sessions, I think it should probably log both before and after doing stuff.

yambottle commented 2 years ago

@apjanke Nice article, thanks for sharing! The INFO/DEBUG level separation is what I always do as well. I'll take your suggestion to log before and after.

yambottle commented 2 years ago

@apjanke @vijayiyer05 This PR has been updated based on our discussion, please take a look. Let me know if you want me to squash and merge it, thanks! https://github.com/djoshea/neuropixels-toolkit/pull/11

MATLAB-Community-Toolboxes-at-INCF / neuropixels-toolkit