DataInterface nesting structure

CodyCBakerPhD commented 3 years ago

Pertaining to PR #149, I'd like to open a discussion between maintainers @luiztauffer @alejoe91 @bendichter for everyone's input on evaluating the nesting structure for the data interfaces. @sneakers-the-rat has proposed shifting from a datainterfaces/formatname.py approach to interfaces/extractortype/formatname.py. The datainterfaces/formatname.py approach was inherited originally from the spikeextractors nesting format, which was adopted by roiextractors, and then by us to keep things consistent.

Ultimately, we want to (a) minimize the code necessary to run the repo functional for easier maintainability but hopefully not at the cost of (b) being too hard for users to explore. While (b) could certainly be combated independently through enhanced documentation, it may be worth asking if this proposed methodology might help as well. One downside to the extractortype nesting is it duplicates the overall number of files for any format that has multiple extractor types (in cases like Neuroscope, this could generate 3-4x as many files, and the extra lines of import code required on the top).

There could also be a nesting structure of interfaces/datatype/formatname.py where datatype in ["ecephys", "ophys"] that could be worth throwing in the mix for discussion. Going this route could solve the issue of code file amplification (since no formats are both types, I think) and could also make managing imports from each package more convenient/localized as well. (ecephys imports drawing from only spikeextractors in the init, etc).

Anyway, just wanted to see what everyone's thoughts are.

sneakers-the-rat commented 3 years ago

Yes, happy to discuss further :)

Here is the structure of the library in the current master branch:

./nwb_conversion_tools/
├── __init__.py
├── auto_qc.py
├── basedatainterface.py
├── baseimagingextractorinterface.py
├── baselfpextractorinterface.py
├── baserecordingextractorinterface.py
├── basesegmentationextractorinterface.py
├── basesortingextractorinterface.py
├── conversion_tools.py
├── datainterfaces
│   ├── __init__.py
│   ├── blackrockdatainterface.py
│   ├── ceddatainterface.py
│   ├── cellexplorerdatainterface.py
│   ├── imagingextractorinterface.py
│   ├── intandatainterface.py
│   ├── interface_utils
│   │   ├── __init__.py
│   │   └── brpylib.py
│   ├── neuroscopedatainterface.py
│   ├── openephysdatainterface.py
│   ├── phydatainterface.py
│   ├── roiextractordatainterface.py
│   ├── sipickledatainterfaces.py
│   └── spikeglxdatainterface.py
├── json_schema_utils.py
├── metafile.schema.json
├── nwbconverter.py
├── schemas
│   ├── source_example.json
│   └── source_schema.json
└── utils.py

and this is how i've refactored it

./nwb_conversion_tools
├── interfaces
│   ├── imaging
│   │   ├── __init__.py
│   │   ├── base_imaging.py
│   │   └── imaging.py
│   ├── interface_utils
│   │   ├── __init__.py
│   │   └── brpylib.py
│   ├── recording
│   │   ├── lfp
│   │   │   ├── __init__.py
│   │   │   ├── base_lfp.py
│   │   │   ├── neuroscope.py
│   │   │   └── spike_glx.py
│   │   ├── __init__.py
│   │   ├── base_recording.py
│   │   ├── blackrock.py
│   │   ├── ced.py
│   │   ├── intan.py
│   │   ├── neuroscope.py
│   │   ├── open_ephys.py
│   │   ├── spike_glx.py
│   │   └── spike_interface.py
│   ├── segmentation
│   │   ├── __init__.py
│   │   ├── base_segmentation.py
│   │   └── roi.py
│   ├── sorting
│   │   ├── __init__.py
│   │   ├── base_sorting.py
│   │   ├── blackrock.py
│   │   ├── cell_explorer.py
│   │   ├── neuroscope.py
│   │   ├── open_ephys.py
│   │   ├── phy.py
│   │   └── spike_interface.py
│   ├── __init__.py
│   └── base_data.py
├── schemas
│   ├── source_example.json
│   └── source_schema.json
├── spec
│   └── __init__.py
├── __init__.py
├── auto_qc.py
├── conversion_tools.py
├── json_schema_utils.py
├── metafile.schema.json
├── nwbconverter.py
└── utils.py

As is, the hierarchy is essentially flat with the base classes in the root directory and the interfaces in datainterfaces. There are a few downsides to that:

Dev & User Accessibility - As a new user it was a little challenging to figure out the structure of the library. I had to go through a bunch of the source until I was able to see the hierarchical structure of the data interfaces, to parse that all the base...interface.py were related, that baseimaging... inherited from basedatainterface.py and so on. I feel like this might be a structure that y'all have (completely understandably) gotten used to, it being your project, but for an outside developer this is a lot of overhead. This could be mitigated, as you say, with documentation, but in my opinion it's ultimately easier in the long run both for introducing people to the library and for writing documentation to make the directory structure of the project match the logical structure of the code: objects with direct inheritance relationships are nested within each other. Code structure is ultimately a developer-oriented problem, as any user-oriented structure can be abstracted with, eg. imports (a interfaces/format/open_ephys.py file that imports the converters), runtime aliasing (eg. on import, interfaces/__init__.py constructs a formats dictionary that crawls interface objects and groups them according to some FORMAT class attribute. i've done this before here, for example), or a completely abstract get_interface('recording', 'open_ephys') where the entire backend could change but it would remain the same for the user.
Expandability & Maintainability - Where would an interface for behavioral data (an official part of the nwb schema) go? What about an interface without an explicit format, eg. if users had their data as some field in a homebrew MATLAB struct? In these cases, one might want to implement a BehaviorDataExtractor metaclass and then some PoseDataExtractor for markerless tracking points. In that case we would have a chimeric naming structure where some things are named according to their data format, but others are named according to their data type. Grouping first by extractor/interface metaclass is more general, where a company-named format is a subset of an interface type. The question is more general -- when you have multiple people contributing to the code, having one, obvious place to put things makes it so future PRs like these are less cumbersome to deal with. Speaking from experience, it's a chore to maintain a bunch of explicit imports like this and a list of available objects like this, but it's challenging to programmatically enumerate/group objects without a computable structure and you'll end up writing a bunch of code on top of it to patch over it. In my opinion, the absolute number of files is less important than a clear structure: having 5, 1000-line files is harder to manage & maintain than 20, 250-line files in a logical directory structure.
Abstraction - Speaking of programmatic access to interfaces, imo that should be a design goal for a package targeted at a general audience. Currently the user is intended to use the whole library, making their own subclasses, accessing interfaces explicitly, etc. so the target audience seems to be programmers/folks with intimate knowledge of the library. To abstract the nitty-gritty of the library and give the user a well-defined point of interaction, it needs to be possible to, eg. choose from available 'spike sorting' interfaces, etc. which is made much easier with a code structure that reflects object hierarchy. eg. see my implementation of list_interfaces which depends on _recurse_subclasses, which in turn requires any subclasses to be imported _recursive_import. This isn't to say that implementation is perfect, but just an example of the way a computable code structure makes certain things easier.
Documentation - Autogenerated docs from docstrings are salvation, but it would be pretty cumbersome to maintain them as is. See the API docs example:

Where interfaces are neatly sorted by type, which is implemented in rst as:

Recording
==========

.. automodule:: nwb_conversion_tools.interfaces.recording

Base Recording
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.base_recording

Blackrock
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.blackrock

CED
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.ced

Intan
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.intan

Neuroscope
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.neuroscope

Open Ephys
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.open_ephys

Spike GLX
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.spike_glx

Spike Interface
---------------

.. automodule:: nwb_conversion_tools.interfaces.recording.spike_interface

Doing API docs for the current structure would result in the top-level tree under "interfaces" as a big flat long list, or else would require a lot of manual curation at the level of individual classes, rather than modules, as is done here in the existing documentation (whose end structure mirrors the structure proposed here of recording interfaces grouped together separately from sorting interfaces underneath ephys and ophys): https://github.com/catalystneuro/nwb-conversion-tools/blob/master/docs/converting_data_to_nwb.md

I like the interfaces/ecephys/... and interfaces/ophys/... structure, as that would mirror the nwb spec more closely, and think that would also be a good way to refactor the extractor interfaces: perhaps the recording extractor class has subclasses ecephys etc. My proposal ultimately is not "lets do this exact format," but that a path structure that mirrors the logical structure of the code would be good, so anything under that heading is good with me :)

So anyway, ty for entertaining this PR, don't mean to ride in and be very bossy, just a proposal. I come in peace and goodwill and intention to work!

bendichter commented 3 years ago

Thanks for breaking this down, @sneakers-the-rat! It's worth thinking about restructuring, especially if it will help with documentation and accessibility. Our current structure is similar to SpikeExtractors, though I agree that (1) their structure may not be ideal in the first place and (2) we may need a bit of a different structure because our scope is broader. I can see how having the folder structure reflect the class structure would be helpful for a developer.

Let's think about the pros and cons of our available options. I think separating first by modality makes the most sense, especially if we are going to use this for generating documentation, because that would be a very natural way for a user to navigate the available tools. I would wonder if organizing by acquisition system might have certain advantages. (1) Would users interested in a SortingInterface for a certain acquisition system also be interested in the RecordingInterface for the same system? If so, it might make sense to have them together in the docs (even though their class hierarchy is different) (2) Are there any Recording and Sorting Interfaces for the same acquisition system that would share code, for instance to retrieve metadata? If so, it might be better to have them in the same folder or file.

All of that being said, I think the structure you propose may be worthwhile. If we do move things around, most important will be to be very careful not to break existing imports for downstream repos, and to smoothly deprecate if any imports change.

luiztauffer commented 3 years ago

@sneakers-the-rat I’m impressed by the scope of this first contribution, thanks for that! Overall I agree with those points, reproducing the inheritance structure in the directory organization would be advantageous to everyone, new and old developers.

I like Cody suggestion for interfaces/datatype/formatname.py where datatype in ["ecephys", “icephys”, "ophys", “behavior”, “misc”]. In fact, that’s how nwb-conversion-tools used to be before a big refactor last year. This options aligns well with pynwb api organization.

I don’t see much advantage in separating recording from sorting though, since most interfaces for electrophys formats will likely have both, and as @CodyCBakerPhD pointed out will multiply files and scatter attention of what imo belong together.

Those changes are likely to break all current lab-specific code, but I feel this would be quick to fix, since it’s simply changing the import headers (we can also fix nwb-conversion-tools==0.7.2). I suggest we go adding all proposed changes in a separated branch (not on master) before we have time to carefully test for all current lab projects.

sneakers-the-rat commented 3 years ago

fabulous! Y'all have better perspective on recording vs sorting than I do, I was just following the inheritance, so yes by all means combine them!

feel free to edit the PR, pull to a different branch, etc. :)

CodyCBakerPhD commented 3 years ago

As @luiztauffer indicated, as of early November last year we used a ./ecephys and ./ophys nesting structure, but apparently it wasn't within a datainterfaces/. umbrella, they were just out in the open. We also only supported a few formats at that time as it was a very early prototyping stage.

Since the consensus seems to be to rearrange the interfaces according to a datatype in ["ecephys", “icephys”, "ophys", “behavior”, “misc”] structure, I'll make a branch with just that particular change in place.

CodyCBakerPhD commented 3 years ago

Moving forward on this!

catalystneuro / nwb-conversion-tools

DataInterface nesting structure #153