NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
178 stars 84 forks source link

Centralize data migrations definitions and handling #1222

Open yarikoptic opened 4 years ago

yarikoptic commented 4 years ago

Is your feature request related to a problem? Please describe.

NWB is still evolving, quite frequently requiring data migrations. ATM migrations are coded in various places/data types, often with code duplication to handle them (see e.g. https://github.com/NeurodataWithoutBorders/pynwb/pull/1219). For a user there is no way to discover what really would change in his existing file if he would to migrate to a new version of NWB. validate ATM would use schema stored along with the file, without alerting user (with some kind of INFO or WARNING) that some values are to be changed in a new version of NWB (#1216 - based on #1219 use case)

Describe the solution you'd like

Provide a registry and support tooling for data migrations, so that

e.g. for #1219 I could see properties like (if to be described as yaml)

codename: hardcoded_units_210
description: Since NWB 2.1.0 units for those neural data types must not be specified, and will be hardcoded
nwb_versions: "< 2.1.0"  # some way to describe where applicable? but might be just an additional filter
callable: ensure_unit
stage: init  # describing when should be fixed up?  need more use cases to see what needed
objects:  # provide filters and arguments so support code could decide where to apply
 - ndtype: CurrentClampSeries
   unit: volts
 - ndtype: CurrentClampStimulusSeries
   unit: amperes
 ...

Then underlying __init__ of the base class (or might be even __new__ before) would just see if there is a match for a given data migration given the class at hands and properties (versions and values may be) and apply migration. Might need more thought, but first I wanted to suggest the principle, and you would probably know of other cases where similar migrations are happening so we could lay them out first to generalize the spec flexible enough to allow their specification.

bendichter commented 4 years ago

@yarikoptic Interesting idea. This could help us work cross-language as well. So are you suggesting that we add the concept of deprecation to the HDMF schema language or create a separate system for migration?

yarikoptic commented 4 years ago

I don't know hdmf vs pynwb separation well enough to provide an informed opinion. But if hdmf to be used for something else but pynwb, provisioning such migrations in hdmf itself could be a nice feature. Not sure how ready it would be to accomplish cross-language migration functionality besides choosing (instead of crafting a new one) a language and making it possible for other languages to use it.

oruebel commented 4 years ago

I agree that this is a relevant need. I think there would likely be a split with common migration features in HDMF and some NWB-specific tooling in PyNWB. I'm honestly not sure right now what will be needed here, but I'd expect that this will require some significant effort.

bendichter commented 4 years ago

Looking more closely at @yarikoptic's original comment, he was referring to the unit of measurement enforcement across neurodata types as an example. This points to a need to migrate features of neurodata_types. The yaml that Yarik proposes IMO crosses into implementation. It doesn't define the functions, but it does say what functions to call and when, so technically yes this is declarative, but doesn't have the language-agnostic benefit that most declarative solutions would have, as it would still depend on specific functions.

Perhaps a related and more common use-case would be deprecating an entire neurodata type in the schema in favor of another.

Either way, if we want to express deprecation concepts in the NWB schema YAML files, those YAML files are written using the schema language. The NWB schema uses the HDMF schema language with minor modifications (data_type -> neurodata_type). So I guess our choices are: 1) Extend the nwb flavor of HDMF schema language to handle deprecation 2) Extend HDMF schema language itself to handle deprecation for the entire HDMF family

I'd prefer option 2, because I think this functionality doesn't seem like a neuroscience-specific thing, and other projects might also benefit.