catalystneuro / neuroconv

Create NWB files by converting and combining neural data in proprietary formats and adding essential metadata.
https://neuroconv.readthedocs.io
BSD 3-Clause "New" or "Revised" License
50 stars 21 forks source link

[Feature]: Standardizing Conversion Config Options #280

Closed CodyCBakerPhD closed 1 month ago

CodyCBakerPhD commented 1 year ago

What would you like to see added to NeuroConv?

@bendichter

Summary of conversation in PR #272.

I've broken it into several expandable sections as each point relates to multiple comments.

I've linked to original comments whenever they contain more information than the summary.

Conversations about this began back in #116 between @h-mayorquin and I.

The Problem

It has become annoying to compile various fields in what we refer to as the conversion_options of run_conversion calls. These overlap with their corresponding parts of the neuroconv.tools.

They tended to evolve differently which lead to some splits in how identical fields are specified between certain types of interfaces
In our ecephys data interfaces and tools we have ```yaml conversion_options: stub_test: compression: compression_opts: iterator_type: iterator_opts: ``` In our ophys data interfaces and tools we have ```yaml conversion_options: stub_test: stub_frames: compression_options: # takes a field 'compression' to specify type iterator_type: iterator_options: ``` the behavior interfaces can be all over the place in which convention they choose, as raised in an older issue #42.
NeuroConv is not always consistent with external APIs (nor should it be)
From https://github.com/catalystneuro/neuroconv/pull/272#issuecomment-1370331255 in response to https://github.com/catalystneuro/neuroconv/pull/272#issuecomment-1370261918 - NeuroConv, like SpikeInterface, is a universal API for data ingestion from multiple sources. - However, upstream methods that do similar things can have different call signatures across different libraries. - For consistency, SpikeInterface and NeuroConv alias upstream `kwargs` with common references. - Example: SpikeInterface aliases Neo RawIO kwargs to `file_path` and `folder_path` + `stream_name` - Example: Neo aliases most of it's dependencies to `filename` and `dirname` - Example: NeuroConv maps SI/ROIExtractors/Neo(icephys)/other behavior dependencies (like OpenCV) to `file_path` and `folder_path` - Even the name of `DataIO` arguments like `compression`/`compression_opts` change between HDF5 and Zarr.
Readability of argument names is important
- Abbreviations in argument names (and variable names in general) can actually make code unnecessarily hard to read. - Example: `_opts` could mean any of: `optics`, `optical_series`, `operations`, or it is itself [an actual word](https://www.merriam-webster.com/dictionary/opt). This would likely be even more confusing for a coding novice who may never have seen that abbreviation used that way before. This is also only one of many examples, I can find more if you like. - It only takes a fraction of second to clarify this by writing the full name, `_options`. Use of tab completion during development also completely negates the drawback of using longer variable names. - NeuroConv presents an opportunity for us to alias such things in a way that maximizes user-friendly readability.
The primary issue is not just about reaching an agreement on standard option names, but a reflection on the practice of adding more and more arguments every time we want to propagate something new
[From Heberto](https://github.com/catalystneuro/neuroconv/pull/272), > I find function signatures with a lot of arguments distracting and more intimidating at first glance. With Szonja indicating a similar feeling at an in-person meeting. - If our own developers feel this way, it's quite likely that inexperienced users would as well. - NeuroConv is focusing more and more on not just being easy to use for Python novices, but also for non-coding individuals in general. - This echoes the [code smell of too many parameters](https://en.wikipedia.org/wiki/Code_smell#:~:text=Too%20many%20parameters%3A%20a%20long,a%20more%20clean%2Dcut%20way.) pointed out by Heberto.
Too many parameters is also the main cause of repeated docstrings #278
- The total number of docstring repetitions for current arguments grows quadratically in the number of times that argument is repeated by the number of repeated arguments from the top level. - To add a new argument, we need to replicate the docstring in each sub-function. - To do a manual check for consistency across the repo, we need to search a total of (number of arguments that get repeated) x (number of times each argument gets repeated) docstrings. - That amount of time is not trivial, and is the main thing SpikeInterface and matplotlib are trying to avoid with their dynamic docstrings.
Keyword arguments, as the YAML uses, are not ordered
From https://github.com/catalystneuro/neuroconv/pull/272#issuecomment-1370154477, the current design allows the possibility to do things like ```yaml conversion_options: SomeInterface: compression_options: 4 iterator_options: buffer_gb: 5.0 some_other_options: ... ... compression_type: gzip even_more_options: ... ... iterator_type: v1 # would actually cause an error, but can you easily see how to fix it? ``` which causes an error, and takes longer than necessary to debug due to how the inter-dependent options are not visually linked.
It only gets worse when we want to add new features
From https://github.com/catalystneuro/neuroconv/pull/272#issuecomment-1370331255 - Better support for a Zarr backend and data IO has been requested #202 and would be a great thing to add in NeuroConv. - The names of compression related arguments different with Zarr. - The compression options and even iteration/iterator options are handled differently in that case as well. - It also introduces `filter_type` and `filter_options` (these apply before the compression in Zarr, such as a Delta). - We'd also like to eventually add more stub options - ROIs for segmentation - channels for recordings For both of these additions, if we just add new pairs of arguments to the outer level, we'd make the above problems even worse.

In conclusion, @h-mayorquin, @weiglszonja, and I all see this as a problem for consistency of the internal NeuroConv API, maintainability of existing codebase and documentation, and scalability of adding new types of arguments as features.

Also, as per the argument that users are not expected to use any of these options anyway and instead defer to defaults, that may well be true for the compression field but it is definitely not true for iteration (which we often need to adjust based on the system resources and conversion setup) and stubbing (which we actively tell users to enable the first time they try to run a pipeline).

Proposed Solutions

Original: Multiple Nested Dictionaries
- Condense each argument pair into a single argument `*_options` which is a dict-of-dicts with - One required key called `method`, which specifies the 'type' of whatever option being used - One optional level called `method_options` that is the dictionary of arguments that get passed dynamically into whatever that type is Example: ```yaml conversion_options: compression: gzip compression_opts: 4 iterator_type: v2 iterator_options: buffer_gb: 5 progress_bar: True ``` becomes ```yaml conversion_options: compression_options: method: gzip method_options: level: 4 iterator_options: method: v2 method_options: buffer_gb: 5 progress_bar: True ``` #### Pro's - strong consistency in how all options are specified. You know how each type of option will be referenced `{name_of_option}_options`, and you know how the sub-fields of that will be referenced `method` and `method_options` as catch-alls. - easy to read and communicate via a YAML structure - easy to implement validation with Pydantic - consistent with NeuroConv's general practice of using dict-of-dict's to describe things like `source_data`, `metadata`, and `conversion_options` #### Con's - hard to read, type, and communicate in Python when using the API - validation is brittle, but I'd say that's possible a good thing. Only a single key/value would need to be out of place to trigger an error - more nesting is typically not a good Python practice - the aforementioned quadratic scaling is only alleviated by a constant factor (2) per number of 'types' of options - this does not alleviate the docstring problem either
Heberto's Suggestion: Multiple Flat Dictionaries
From https://github.com/catalystneuro/neuroconv/pull/272#issuecomment-1370157938 - Similar to the nested dictionary, but instead of nesting just pass any options dynamically along with the `method`. - This is how [Pandas does things](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) Example: ```yaml conversion_options: compression: gzip compression_opts: 4 iterator_type: v2 iterator_options: buffer_gb: 5 progress_bar: True ``` becomes ```yaml conversion_options: compression_options: method: gzip level: 4 iterator_options: method: v2 buffer_gb: 5 progress_bar: True ``` #### Pro's - strong consistency in how all options are specified. You know how each type of option will be referenced `{name_of_option}_options`, and you know how the sub-fields of that will be referenced `method` and `method_options` as catch-alls. - easy to read and communicate via a YAML structure - easy to implement validation with Pydantic #### Con's - validation is brittle, but not as bad as the nested case - the aforementioned quadratic scaling is only alleviated by a constant factor (2) per number of 'types' of options - this does not alleviate the docstring problem either
Config Class
This is a really good idea that @weiglszonja came up with in a meeting the other day. - Similar to the nested dictionary, but rather than it being a hard-coded dictionary-of-dictionary in the API, it's a Python dataclass (or Pydantic Model even, which is basically an enhanced version with self-validating features). - The YAML representation would appear the same as either of the above approaches. - The API would restrict available options to pre-specified literals - We choose one of several different designs for how to communicate available options (auto-tab completion for attribute fields, or class methods like `get_options` like how `hdf5plugin.get_available_filters`) - I'll leave exact implementation discussion for a separate issue if this is what we decide to go with This is similar to how [sci-kit learn](https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-kernel-api) handles classes that take complex configurations. [Pydantic itself](https://docs.pydantic.dev/usage/model_config/) uses Config classes just like this through its core model structure. Additionally, this means that instead of passing a series of argument pairs (as we do right now on `main`), and instead of passing multiple `*_options` arguments into our functions, each would simply take a `Config` as a single argument and that class would specify everything the conversion needs to know. We would only have to document the rich docstring once for the single Config class, which reduces duplication of that information across the rest of the repo. #### Pro's - strong consistency in how all options are specified. You know how each type of option will be referenced within the Config class; `{name_of_option}_options`, and you know how the sub-fields of that will be referenced `method` and `method_options` as catch-alls. - easy to read and communicate via a YAML structure - easy to implement validation with Pydantic (especially if the Config itself gets implemented as a Pydantic model) - the only solution so far to solve the scalability and docstring problem #### Con's - ?

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

h-mayorquin commented 1 year ago

I favor the configuration class as well but the problem with it is that it will be slightly more complex to develop and, if we do it wrong, it could hamper usability. I am uncertain about whether this will reduce or increase or maintenance costs. I was trying to made the same point before but I don't think I explained myself very well. That said, I still think it is the best option.

I think that a data class is not sufficient though. Take compression for example, given the inconsistent signature of H5DataIO I think the class should have a wrapping method that provides a consistent API for wrapping data with enabled compression. I have implemented a WIP of this #281 to be more concrete.

This is also valid for the other conversion configurations. Here is the full table: 1) compression_strategy : validates the configuration of name and options and wraps the data in H5DataIO. 2) iterative_writing_strategy: validates the configuration and wraps the data in the appropriate iterator. 3) stubbing_strategy: validates the configuration and instantiates the right sub-extractor (slices in time and/or channels).

All of them have a validation part that is done internally and an operation that wraps or reshapes the data for the desired end which is called in our writing pipeline. Encapsulating this logic into out pipelines is easier to tests and I think it makes the code more readable (the programmer reading the code knows that this where stubbing, compression or iterative write configuration happens but is not shown the details). Having classess for configuration also allows a single object to be passed around working as a global configuration if that is desirable. Finally, this means that we only need to provide rich documentation in one place as mentioned in #278.

h-mayorquin commented 1 year ago

Concerning this: @CodyCBakerPhD

Also, as per the argument that users are not expected to use any of these options anyway and instead defer to defaults, that may well be true for the compression field but it is definitely not true for iteration (which we often need to adjust based on the system resources and conversion setup) and stubbing (which we actively tell users to enable the first time they try to run a pipeline).

I think we can alleviate things on the usability front somehow by allowing a two tier level of configuration. For example, for stubbing. We can have two simple possible options where stubbing=True enables stubbing with reasonable defaults and (say 100 frames and no more than 100 channels) and stubbing=False just switches off stubbing completely. This is the simple tier. We also provide the option of passing a stubbing=stubbing_strategy_object which allows for finer control at the costs of higher API complexity (users need to instantiate a class). This is the fine tuning tier.

I think this division can work for compression where the simple tier is deciding between standard gzip level=4 compression and no compression at all whereas the fine tuning tier is implemented passing a compression_strategy_object that allows for choosing another method and its parameters.

I am less certain how this can work with the iterative data write specification. What is the thing that needs to be changed, @CodyCBakerPhD ? I thought that here the simple tier could be choosing between simple defaults with 1GB chunks and not using iterative write at all whereas the fine tuning tier would be allowing control of chunks, shapes, axis, etc.

What do you think? Does this make sense or am I missing something important?

CodyCBakerPhD commented 1 year ago

We can have two simple possible options where stubbing=True enables stubbing with reasonable defaults

This is precisely the way the H5DataIO already behaves for the compression option, so we don't need to build anything on that ourselves if we just map compression=True (ditto for compressor=True in Zarr).

The only reason I don't do that all the time is just to communicate more easily in the code 'what' that default equivocates to (otherwise you have to go and read through a bunch of upstream documentation to determine that it's level 4 GZIP).

Having classess for configuration also allows a single object to be passed around working as a global configuration if that is desirable.

Yep, this is the biggest point at the end of day to resolve the concerns that originated this discussion.

All of them have a validation part that is done internally and an operation that wraps or reshapes the data for the desired end which is called in our writing pipeline.

This goes beyond what I was imagining, which was just a data object that passes kwargs to the respective methods. But I can also see how this could provide an opportunity to standardize how those methods are wrapped through the tools, which can sometimes take a bit of extra code that has already been refactored once or twice across SI/ROIExtractors.

Will have to think on that more... For now it might be best for a first PR to simply pass parameters and a follow-up PR to introduce the more data-interacting side.

I am less certain how this can work with the iterative data write specification. What is the thing that needs to be changed, @CodyCBakerPhD ? I thought that here the simple tier could be choosing between simple defaults with 1GB chunks and not using iterative write at all whereas the fine tuning tier would be allowing control of chunks, shapes, axis, etc.

The simple kwarg passing is especially true for iteration, so nothing would need to be changed. Each 'type' of iterator takes and passes the respective kwargs of either the DataChunkIterator or a subtype of GenericDataChunkIterator, and the 'options' are all the same for those across the tools. The iterator classes themselves also perform input validation so at least to start with our communication classs wouldn't have to do anything other than pass them along to the methods.

bendichter commented 1 year ago

I think we have spent too much time on this and I'd like to wrap it up.

I think one challenge here is building a specification that works across the different use-cases: built-ins, hdf5plugin, and Zarr. Another is that the allowable args for the options change depending on the type of compression. We could formally specify the input options as they change for each different compressor, which would require us to build a class for each compressor, but honestly I don't think this is really complex enough to require the class approach. Besides, I think that we would need to create (n_compression methods x n_backends) classes, which I think is too many. Thanks @h-mayorquin for pointing out the class-based specification of the kernels in scikit-learn. That's an interesting case study, however I don't think that approach fits here. There, they have a few different methods with lots of interdependencies of the parameters. The complexity is in the parameterization of each of a few different configuration options, so there are a few classes with complex logic in each. Here, we have lots of different types of configs but not much complexity within each, which would translate to many small classes. I don't see much benefit in a single class for all compression options.

I'll cede the point that a dictionary is more explicit than a list of args and a list of args may have been overfitting to the hdf5plugin API. How about this:

AVAILABLE_COMPRESSION_METHODS = [
    "gzip",
    "blosc",
    ...
]

run_conversion(
    ...,
    compression_method: Literal[True, False, *AVAILABLE_COMPRESSION_METHODS] = True
    compression_options: Optional[dict] = None,

)
"""
Paremeters
-----------
...
compression_method: {"gzip", "blosc", ..., True, False}
compression_options: dict
    changes depending on the compression_method. See docs on compression [here].
"""

Then all of the complications regarding how compression_options changes based on data compression_method and data backend can be off-loaded to a docs page. It's too complicated to try to do in a docstring.

I know most of you wanted to group the method and the options into one dictionary, but I prefer my way for a few reasons.

  1. This is more pythonic:

"Flat is better than nested"

  1. This allows us to specify the available methods explicitly on the call signature without relying on JSON Schema.
  2. This is similar to other popular APIs where the options of a methods are defined in a separate dictionary, e.g. matplotlib: subplots {subplot, gridspec}_kw seaborn: pairplot {plot, diag, grid}_kws

Here are a few examples of different configs and how they would be translated.

Example 1: auto

compression: True

would translate to:


H5DataIO(data, compression=True) # gzip, lvl 4 by default in h5py

Example 2: gzip with custom level

compression_method: gzip
compression_options:
    clevel: 4

would translate to:

H5DataIO(data, compression="gzip", compression_opt=4)

Example 3: hdf5plugin

compression_method: blosc
compression_options:
    cname: zstd
    clevel: 4
    shuffle: 2

would translate to:

For HDF5:

filter = hdf5plugin.get_filters("blosc")[0]
H5DataIO(data, allow_plugin_filters=True, **filter(cname="zstd", clevel=4, shuffle=2)))

For Zarr:

from numcodecs import Blosc
ZarrDataIO(data, compressor=Blosc(cname="zstd", clevel=4, shuffle=2))

I don't think this would be too hard to do with a simple function and a few if/else clauses. Basically something like:

from functools import partial

def configure_compression(compression_method, compression_options):
    if output_type == "h5":
        if compression_method in ("gzip",):
            return partial(lambda x: H5DataIO(x, compression="gzip", compression_opt=compression_options["clevel"]))
        elif compression_method in ("blosc", ...):
            filter = hdf5plugin.get_filters("blosc")[0]
            filter_kws = filter(**compression_options)
            return partial(lambda x: H5DataIO(data, allow_plugin_filters=True, **filter_kws))
    ...
)

...

TimeSeries(
    ...
    data=configure_compression(data)
)

I am imagining for the GUI forms we could have a drop-down for the available compressor types, and do a little bit of custom JavaScript to load the compression_options form based on what compression method is selected.

Am I missing something? Are there any problems with this approach that I am not seeing? Unless there are any problems, I would like to move forward with this.

CodyCBakerPhD commented 1 year ago

Many of these ideas and/or the demonstration of #281 could be incorporated into a DatasetConfig class, with the main new feature being specification of dataset-level options for interfaces that span multiple datasets, such as multi-segment recording extractors

@dataclass
class DatasetConfig:
    stub_shape: Optional[Tuple[int]] = None  # specifying this is the same as saying `stub=True`
    compression: Union[bool, int, str] = True
    compression_options: Optional[dict] = None
    iteration: bool = True  # TODO, remove old v1 type, arguments becomes simple enable/disable of wrapper
    iterator_options: Optional[dict] = None

this is a low-priority feature to add, however

h-mayorquin commented 1 month ago

I am going to close this. With all the work of the backend configuration all the compression options are now specified:

https://neuroconv.readthedocs.io/en/main/user_guide/backend_configuration.html

There is still the issue of stubbing, metadata keys, iterative writing configuration and the lik but I think we can start those on issues of their own.