Closed CodyCBakerPhD closed 3 years ago
Yes, happy to discuss further :)
Here is the structure of the library in the current master
branch:
./nwb_conversion_tools/
├── __init__.py
├── auto_qc.py
├── basedatainterface.py
├── baseimagingextractorinterface.py
├── baselfpextractorinterface.py
├── baserecordingextractorinterface.py
├── basesegmentationextractorinterface.py
├── basesortingextractorinterface.py
├── conversion_tools.py
├── datainterfaces
│ ├── __init__.py
│ ├── blackrockdatainterface.py
│ ├── ceddatainterface.py
│ ├── cellexplorerdatainterface.py
│ ├── imagingextractorinterface.py
│ ├── intandatainterface.py
│ ├── interface_utils
│ │ ├── __init__.py
│ │ └── brpylib.py
│ ├── neuroscopedatainterface.py
│ ├── openephysdatainterface.py
│ ├── phydatainterface.py
│ ├── roiextractordatainterface.py
│ ├── sipickledatainterfaces.py
│ └── spikeglxdatainterface.py
├── json_schema_utils.py
├── metafile.schema.json
├── nwbconverter.py
├── schemas
│ ├── source_example.json
│ └── source_schema.json
└── utils.py
and this is how i've refactored it
./nwb_conversion_tools
├── interfaces
│ ├── imaging
│ │ ├── __init__.py
│ │ ├── base_imaging.py
│ │ └── imaging.py
│ ├── interface_utils
│ │ ├── __init__.py
│ │ └── brpylib.py
│ ├── recording
│ │ ├── lfp
│ │ │ ├── __init__.py
│ │ │ ├── base_lfp.py
│ │ │ ├── neuroscope.py
│ │ │ └── spike_glx.py
│ │ ├── __init__.py
│ │ ├── base_recording.py
│ │ ├── blackrock.py
│ │ ├── ced.py
│ │ ├── intan.py
│ │ ├── neuroscope.py
│ │ ├── open_ephys.py
│ │ ├── spike_glx.py
│ │ └── spike_interface.py
│ ├── segmentation
│ │ ├── __init__.py
│ │ ├── base_segmentation.py
│ │ └── roi.py
│ ├── sorting
│ │ ├── __init__.py
│ │ ├── base_sorting.py
│ │ ├── blackrock.py
│ │ ├── cell_explorer.py
│ │ ├── neuroscope.py
│ │ ├── open_ephys.py
│ │ ├── phy.py
│ │ └── spike_interface.py
│ ├── __init__.py
│ └── base_data.py
├── schemas
│ ├── source_example.json
│ └── source_schema.json
├── spec
│ └── __init__.py
├── __init__.py
├── auto_qc.py
├── conversion_tools.py
├── json_schema_utils.py
├── metafile.schema.json
├── nwbconverter.py
└── utils.py
As is, the hierarchy is essentially flat with the base classes in the root directory and the interfaces in datainterfaces
. There are a few downsides to that:
base...interface.py
were related, that baseimaging...
inherited from basedatainterface.py
and so on. I feel like this might be a structure that y'all have (completely understandably) gotten used to, it being your project, but for an outside developer this is a lot of overhead. This could be mitigated, as you say, with documentation, but in my opinion it's ultimately easier in the long run both for introducing people to the library and for writing documentation to make the directory structure of the project match the logical structure of the code: objects with direct inheritance relationships are nested within each other. Code structure is ultimately a developer-oriented problem, as any user-oriented structure can be abstracted with, eg. imports (a interfaces/format/open_ephys.py
file that imports the converters), runtime aliasing (eg. on import, interfaces/__init__.py
constructs a formats
dictionary that crawls interface objects and groups them according to some FORMAT
class attribute. i've done this before here, for example), or a completely abstract get_interface('recording', 'open_ephys')
where the entire backend could change but it would remain the same for the user.BehaviorDataExtractor
metaclass and then some PoseDataExtractor
for markerless tracking points. In that case we would have a chimeric naming structure where some things are named according to their data format, but others are named according to their data type. Grouping first by extractor/interface metaclass is more general, where a company-named format is a subset of an interface type. The question is more general -- when you have multiple people contributing to the code, having one, obvious place to put things makes it so future PRs like these are less cumbersome to deal with. Speaking from experience, it's a chore to maintain a bunch of explicit imports like this and a list of available objects like this, but it's challenging to programmatically enumerate/group objects without a computable structure and you'll end up writing a bunch of code on top of it to patch over it. In my opinion, the absolute number of files is less important than a clear structure: having 5, 1000-line files is harder to manage & maintain than 20, 250-line files in a logical directory structure.Where interfaces are neatly sorted by type, which is implemented in rst as:
Recording
==========
.. automodule:: nwb_conversion_tools.interfaces.recording
Base Recording
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.base_recording
Blackrock
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.blackrock
CED
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.ced
Intan
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.intan
Neuroscope
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.neuroscope
Open Ephys
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.open_ephys
Spike GLX
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.spike_glx
Spike Interface
---------------
.. automodule:: nwb_conversion_tools.interfaces.recording.spike_interface
Doing API docs for the current structure would result in the top-level tree under "interfaces" as a big flat long list, or else would require a lot of manual curation at the level of individual classes, rather than modules, as is done here in the existing documentation (whose end structure mirrors the structure proposed here of recording
interfaces grouped together separately from sorting
interfaces underneath ephys
and ophys
): https://github.com/catalystneuro/nwb-conversion-tools/blob/master/docs/converting_data_to_nwb.md
I like the interfaces/ecephys/...
and interfaces/ophys/...
structure, as that would mirror the nwb spec more closely, and think that would also be a good way to refactor the extractor interfaces: perhaps the recording
extractor class has subclasses ecephys
etc. My proposal ultimately is not "lets do this exact format," but that a path structure that mirrors the logical structure of the code would be good, so anything under that heading is good with me :)
So anyway, ty for entertaining this PR, don't mean to ride in and be very bossy, just a proposal. I come in peace and goodwill and intention to work!
Thanks for breaking this down, @sneakers-the-rat! It's worth thinking about restructuring, especially if it will help with documentation and accessibility. Our current structure is similar to SpikeExtractors, though I agree that (1) their structure may not be ideal in the first place and (2) we may need a bit of a different structure because our scope is broader. I can see how having the folder structure reflect the class structure would be helpful for a developer.
Let's think about the pros and cons of our available options. I think separating first by modality makes the most sense, especially if we are going to use this for generating documentation, because that would be a very natural way for a user to navigate the available tools. I would wonder if organizing by acquisition system might have certain advantages. (1) Would users interested in a SortingInterface for a certain acquisition system also be interested in the RecordingInterface for the same system? If so, it might make sense to have them together in the docs (even though their class hierarchy is different) (2) Are there any Recording and Sorting Interfaces for the same acquisition system that would share code, for instance to retrieve metadata? If so, it might be better to have them in the same folder or file.
All of that being said, I think the structure you propose may be worthwhile. If we do move things around, most important will be to be very careful not to break existing imports for downstream repos, and to smoothly deprecate if any imports change.
@sneakers-the-rat I’m impressed by the scope of this first contribution, thanks for that! Overall I agree with those points, reproducing the inheritance structure in the directory organization would be advantageous to everyone, new and old developers.
I like Cody suggestion for interfaces/datatype/formatname.py
where datatype in ["ecephys", “icephys”, "ophys", “behavior”, “misc”]. In fact, that’s how nwb-conversion-tools used to be before a big refactor last year. This options aligns well with pynwb api organization.
I don’t see much advantage in separating recording from sorting though, since most interfaces for electrophys formats will likely have both, and as @CodyCBakerPhD pointed out will multiply files and scatter attention of what imo belong together.
Those changes are likely to break all current lab-specific code, but I feel this would be quick to fix, since it’s simply changing the import headers (we can also fix nwb-conversion-tools==0.7.2). I suggest we go adding all proposed changes in a separated branch (not on master) before we have time to carefully test for all current lab projects.
fabulous! Y'all have better perspective on recording vs sorting than I do, I was just following the inheritance, so yes by all means combine them!
feel free to edit the PR, pull to a different branch, etc. :)
As @luiztauffer indicated, as of early November last year we used a ./ecephys
and ./ophys
nesting structure, but apparently it wasn't within a datainterfaces/.
umbrella, they were just out in the open. We also only supported a few formats at that time as it was a very early prototyping stage.
Since the consensus seems to be to rearrange the interfaces according to a datatype in ["ecephys", “icephys”, "ophys", “behavior”, “misc”]
structure, I'll make a branch with just that particular change in place.
Moving forward on this!
Pertaining to PR #149, I'd like to open a discussion between maintainers @luiztauffer @alejoe91 @bendichter for everyone's input on evaluating the nesting structure for the data interfaces. @sneakers-the-rat has proposed shifting from a
datainterfaces/formatname.py
approach tointerfaces/extractortype/formatname.py
. Thedatainterfaces/formatname.py
approach was inherited originally from the spikeextractors nesting format, which was adopted by roiextractors, and then by us to keep things consistent.Ultimately, we want to (a) minimize the code necessary to run the repo functional for easier maintainability but hopefully not at the cost of (b) being too hard for users to explore. While (b) could certainly be combated independently through enhanced documentation, it may be worth asking if this proposed methodology might help as well. One downside to the
extractortype
nesting is it duplicates the overall number of files for any format that has multiple extractor types (in cases like Neuroscope, this could generate 3-4x as many files, and the extra lines of import code required on the top).There could also be a nesting structure of
interfaces/datatype/formatname.py
wheredatatype in ["ecephys", "ophys"]
that could be worth throwing in the mix for discussion. Going this route could solve the issue of code file amplification (since no formats are both types, I think) and could also make managing imports from each package more convenient/localized as well. (ecephys imports drawing from only spikeextractors in theinit
, etc).Anyway, just wanted to see what everyone's thoughts are.