Initial OOP design | able to call .py module | env setup | launch script implementation

MATLAB-Community-Toolboxes-at-INCF / neuropixels-toolkit

(Under construction) Neuropixel Toolkit is a set of unified Matlab tools for manipulating Neuropixel and Kilosort datasets

MIT License

2 stars 3 forks source link

Initial OOP design | able to call .py module | env setup | launch script implementation #11

Closed yambottle closed 2 months ago

vijayiyer05 commented 2 years ago

@Yambottle Good to see this progress!

Here are some comments:

Request to locate all third-party functions (such as log4m) in a clearly marked third-party package. Suggested name would be external. And then each third-party library in its own subpackage folder (with license as you've done).
If not too much effort, prefer arrays to cell arrays for the stages property in Pipeline. This may require the use of the heterogeneous array mixin for the Stage class, so you can concatenate arrays with mixed types
Somewhat prefer to use CamelCase style for property names, which many shipping MathWorks classes do, which gives a sense that like classes they are a "thing", while camelCase methods are "actions".
Suggest to avoid redundancies in property names, e.g. pipelineInfo in Pipeline could be just Info
Are there any data concepts that could use an object model? E.g. Probes in Config? Consider implementing where sensible. No need to update the class diagram further as you do.

I will also follow-up offline on some additional points.

apjanke commented 2 years ago

Request to locate all third-party functions (such as log4m) in a clearly marked third-party package. Suggested name would be external.

I would actually advocate that it be a package under an +internal subpackage of +npxtoolkit rather than +external: as far as I know, +internal is the only special package name with semantics that Matlab and other IDEs recognize, and +external has no special meaning. And I think this is appropriate here: by taking the code of the other package and transforming/copy-pasting it in to your own namespace, you are internalizing this particular implementation of it, in that it is an internal implementation detail that should only be visible to other internal code of this library, and you are intentionally forking/vendoring this body of code and now have no dependency on a commonly-visible namespace.

vijayiyer05 commented 2 years ago

Good points @apjanke, Mainly I think it would be good to have a single root package for all the 3P stuff. This could live under +internal as you suggest. Perhaps something like +internal\+fork or +internal\+sub? Or maybe long-form ( +internal\+thirdparty) might be preferred.

Request to locate all third-party functions (such as log4m) in a clearly marked third-party package. Suggested name would be external.

I would actually advocate that it be a package under an +internal subpackage of +npxtoolkit rather than +external: as far as I know, +internal is the only special package name with semantics that Matlab and other IDEs recognize, and +external has no special meaning. And I think this is appropriate here: by taking the code of the other package and transforming/copy-pasting it in to your own namespace, you are internalizing this particular implementation of it, in that it is an internal implementation detail that should only be visible to other internal code of this library, and you are intentionally forking/vendoring this body of code and now have no dependency on a commonly-visible namespace.

apjanke commented 2 years ago

Perhaps something like +internal+fork or +internal+sub? Or maybe long-form ( +internal+thirdparty) might be preferred.

Any of these sound fine to me; just a matter of y'all's preferences here.

yambottle commented 2 years ago

Thanks for your inputs here!

I'll go with the /+internal/+thirdparty solution.
I'll update the array syntax and property names.
The current Config class is just a temporary conceptual example to show you guys what kinds of things I will dump in there. The detailed Config class implementation will come in next PR.

yambottle commented 2 years ago

Hi @apjanke @vijayiyer05 ,

Hope y'all doing well!

I have a question about log4m for parallel computing: Since log4m also logs to file as well, and when future parallel computing kicks in, I'm worrying about the log file write lock.

I'm not sure about the best practice, but my solution will be to have a main thread logging files with several separated fork thread logging files. If prefer single log file for easier logging management, maybe concatenate those log files at the end of the entire process.

Let me know your idea, thanks!

Just realized it's using singleton pattern, which means we can only have one logger pointing to one log file. But it seems working well in parallel threads.

import npxtoolkit.internal.thirdparty.logging.log4m
%% main logger
logger = log4m.getLogger();
logger.setLogLevel(logger.DEBUG);

parfor i=1:5
    logger.info("test.m", strcat("thread", num2str(i)));
end

I'll move on implementing logging with it then, let me know if you have any suggestions!

apjanke commented 2 years ago

You are completely correct to be worried about the behavior of logging under parallel/concurrent programming scenarios. And since log4m is a simple Matlab-based library, I'll bet it's never been tested under parallel scenarios.

Keep in mind that multithreading and multiprocessing are two different forms of concurrency, and they will each have their own distinct issues, and those issues will manifest differently on different operating systems. (E.g. on Windows you might have issues with different processes fighting over a file lock; on Unix, you'll have problems because there probably won't be file locks, and that means the different processes may be doing unsafe concurrent access to a single file, with resulting lost or corrupted log data.) In Matlab in general and npxutils in particular, you might be encountering both types of concurrency and may wish to account for both.

The example you give of calling logger methods inside a parfor loop on top of a shared log4m logger object – I honestly don't know exactly how that works! And it will depend significantly on whether your Parallel Computing Toolbox is running against a parpool that is a local "threads" model pool, a local traditional processes model pool, or a pool running on Matlab Parallel Server. (I suspect that what's happening under a "threads" model local pool is that you're getting simple copies of the log4m Logger objects, they have filehandles represented as doubles, and those end up using C style IO on a single filehandle from multiple threads in a single process. This may work fine under simple tests, but is technically unsafe. On the other hand logging operations are supposed to be relatively low frequency, so maybe you'll seldom hit the race condition in practice. I bet this breaks badly if you use a Parallel Server pool and probably for a local multi-process pool too.)

I think your idea of separate log files per worker/process/thread is probably the easiest and safest way to do logging under concurrent execution. To log from multiple workers, especially in separate processes, to a single file, you really need a separate logging "server" process that is the only one with direct access to that file, and it multiplexes log messages collected from multiple clients to do safe and non-lossy interleaved output to the file. (The alternative is to have each logger individually open, exclusive-lock, write, and then close the log file for every single log message, which is safe (because it correctly serializes concurrent access), but absolutely terrible for performance.)

yambottle commented 2 years ago

Awesome break-down!

Since our parallelism will be submitted/handled under certain levels, which means workers will have different levels depending on the workload it received, then I think we can have certain levels of logging files as well: For the entire workflow, we have:

main(entrypoint)
Session
Pipeline 1 ... Pipeline N
Stages are sequential
Task 1 ... Task M per stage, tasks in stages can be in parallel and this tasks are basically referring to each probe

Then the logging files will be

main.log: for main worker logging, it will take care to Session level
pipeline.log * N
Task.log * M

So the total concurrent logging processes will be NM+1, the final logging files will be N(num of stages)*M+1

Even though we go with this above solution, there could be the performance issue you mentioned when there are large amount of pipelines and tasks. Maybe this is a question for @vijayiyer05 , what will be the scale of the entire workflow? Will it be 2 pipeline, 10 tasks/probes? Or 10 pipeline, 100 tasks/probes or even 1000 * tasks/probes etc..

I'll move on to other remaining issues at first. I'll come back later maybe to refactor the log4m.

apjanke commented 2 years ago

Even though we go with this above solution, there could be the performance issue you mentioned when there are large amount of pipelines and tasks.

I wouldn't worry about that: the cost of logging operations should be small compared to the cost of actual work being done. The performance issue I mentioned above is only a concern specifically when opening, locking, and closing a log file for every single logging call in a mutexed manner. If each worker is logging to a separate file, and keeping that file open as it writes multiple log messages (so they get buffered), this should be fine.

yambottle commented 2 years ago

OK, make sense to me as well! Thanks!

apjanke commented 2 years ago

Hi, all.

A couple procedure/workflow level comments on the Pull Request.

It's conventional to do work for a PR on a separate branch coming off of main, even if it's targeting main in the parent repo and you're not using multiple branches at once. This lets you cleanly distinguish commits in the PR from commits in the prior history, update your PR branch relative to updates that show up in the parent repo, do squashes & rebases, and so on. Because this is a special-purpose repo that other people aren't working on, you might not have these problems here. But it's generally good practice.

There are currently 49 commits in this repo. They're not cleanly broken out by topic, and it looks like several of them represent work-in-progress stages, instead of distinct areas of code, and aren't intended to show up in the final commit history of the target repo after merging. The most useful organization for commits in a PR is as a sequence of topics or transformation steps that make sense from the perspective of the reviewer. You might want to squash these commits into a single big commit, or a few larger commits that break things up by topic area. One big commit might be the most appropriate due to the interdependency of various changes here.

This is a large enough PR: +1,000/-2,000 lines, and introduces significant new classes, that it would be nice to have a prose writeup of the changes in the PR, either in the PR description here, a document in the committed code, or both.

apjanke commented 2 years ago

I'd suggest moving all the source code – the +npxtoolkit subdir plus launch.m and test.m files – under a separate src/ or Mcode/ subdirectory. As is, the root directory of the project is also the root directory of its source code tree (the directory that goes on the Matlab path). While this is a common practice for Matlab library projects, I don't think it's ideal: it doesn't separate out things like README.md, the docs/ subdirectory, and so on which aren't part of the source codek, doesn't give you a place to put auxiliary source code which is not part of the library itself (like development tools and so on), and it's not clear to me whether py_modules and configs are required to be located somewhere relative to the code path root.

The launch.m and test.m scripts are sitting at the top level and taking up that name in the overall Matlab namespace, which will collide with any other library or user code that defines functions named launch or test. Depending on how they're intended to be used, I'd suggest either sticking them under the +npxtoolkit package or renaming them to npxtoolkit_launch, launch_npxtoolkit, or something similar that identifies where they come from and avoids collision.

apjanke commented 2 years ago

Would be nice to have some prose in the README.md that describes:

What all the code in this repo is for.
How to install and use it, and what its requirements (dependencies, preferably including supported Matlab versions) are.
- This should include a list of what needs to go on the Matlab path to enable usage of this library, plus any other steps needed to initialize the library.
License info. You've got the MIT-LICENSE.md file there, but it's nice to have that info included in the main README so potential users will see it when getting an overview. Plus you've got portions of the repo that are covered under a different license – that log4m copy looks like BSD 2-Clause License instead of MIT, right? – so you really should call out that this repo contains mixed-license code and the MIT-LICENSE.md file does not cover the entire repo.
Contact info pointing at its project home page (which would probably just be the djoshea/neuropixels-toolkit GitHub repo page at this point.
A brief user guide giving a tour through the classes and how to use them. The diagrams are nice, but not sufficient on their own.
An overview of the repo structure, so users (and developers) know what all these subdirectories are for. there's no standard directory layout for Matlab projects or libraries, so it's nice to tell users what your layout is. i.e. what are the configs, docs, and py_modules subdirs, and how should they be used?

It might also be nice to have a developer-facing document describing how developers working on the nuropixels-toolkit code itself (as opposed to users of it) are supposed to use and maintain it. Could include stuff like:

Repo layout, including details of stuff that are only used by the library developers and not of interest to library users.
Brief code style & practice guidelines.
Maintainer contact info.
Bug report and pull request guidelines. I like to stick this developer-facing doco in a doc-project/ subdirectory, and create a .github subdirectory with PR templates, contact info, and so on, which will be recognized by the GitHub infrastructure.

I turned my own repo structuring practices into a template called MatlabProjectTemplate; you may find it interesting. Note that that's just my own preferred Matlab project template, and not official or standard in any way.

apjanke commented 2 years ago

I'd recommend making the test.m and launch.m files functions instead of scripts. Scripts are kind of sloppy because they end up interacting and modifying the base workspace. And for a reusable/redistributable library, you have no idea what else is going to be in the base workspace. Functions would be more appropriate, and more testable.

Including import statements in a script seems particularly problematic to me, because those imports will stick around and affect the interactive Matlab session's name resolution, and users calling those scripts may not expect that.

apjanke commented 2 years ago

Some of the fields in the classes are using cell arrays to contain objects of other +npxtoolkit classes. For example, Session's pipelines field, Pipeline's stages field, and Stage's taskQueue. The code should probably document and enforce what types of objects are allowed to go in those cell arrays. Enforcement could be done with set.* setter methods.

Maybe some or all of these fields could use object arrays instead of cell arrays? That could provide some additional type safety and more convenient access to their data. The Session.pipelines field looks like it contains Pipeline objects, Pipeline.stages contains Stage objects, and Stage.taskQueue contains TaskBase subclass objects. The ones that contain objects of homogeneous type could be converted to object arrays as is. The ones that contain heterogeneous objects (e.g. Stage.taskQueue can contain a mix of different subclasses of TaskBase) would need to stay cell arrays, or have their contained objects converted to heterogeneous-array object types using matlab.mixin.Heterogeneous. I have not actually used Matlab Heterogeneous classes, so I can't say how well they work.

Are Pipeline and Stage classes final classes that won't have subclasses? Or are they intended for subclassing and extension by users of the library or npxutils-toolkit's internal code?

apjanke commented 2 years ago

The biggest issue I'm running in to with evaluating this code is that I can't actually run the launch.m or test.m test scripts to see it in action, because they depend on datasets that aren't available in this repo, and external tools with hardcoded paths.

It would be really nice if these test scripts were runnable by any user, developer or reviewer. Maybe the data sets they depend on could be made available in some way, like a separate npxutils-test-data repo (maybe with Git LFS) or a download somewhere? And their external tool references made configurable, and with some degree of discovery?

It's much easier to both evaluate and learn to use some code if you can run locally it against a known test data set.

apjanke commented 2 years ago

I'd suggest adding support for a zero-argument calling form of the constructor for Pipeline and all the other objects. That form may not be used in constructing "real" objects, but it's basically the only practical way to support construction of empty or pre-allocated arrays containing dummy values for Matlab objects.

vijayiyer05 commented 2 months ago

This fork/branch was merged to a branch of the root repo in #12