NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
53 stars 16 forks source link

Deprecate compound dtypes, `TimeSeriesReferenceVectorData` #588

Open sneakers-the-rat opened 2 months ago

sneakers-the-rat commented 2 months ago

implementing these now alongside dynamictables and it seems like these are mostly artifacts of history atp that persist bc of the relative rarity of intracellular ephys data nowadays.

Usage

compound dtypes are used 4 times in the schema:

TimeSeriesReferenceVectorData is used in 4 places (but actually 2)

Purpose

The function of compound dtypes are to group multiple values together into something that behaves like a tuple.

The function of TimeSeriesReferenceVectorData is to be a 1-d vector into arbitrary contiguous spans in arbitrary other datasets.

Degeneracy

compound dtypes are equivalently well-served by datasets with multiple attributes. There should be a single Position type and it should have x, y, z as attributes. the hdmf/nwb schema language is actually relatively uniquely able to do this with its *_type_inc that behaves both like a template inclusion as well as inheritance.

TimeSeriesReferenceVectorData is equivalent to DynamicTableRegion in functionality - Both are capable of representing ragged indexes into arbitrary objects.

Specifically this:

- name: timeseries
  neurodata_type_inc: TimeSeriesReferenceVectorData
  quantity: '?'

is functionally equivalent to this

- neurodata_type_inc: DynamicTableRegion
  quantity: '+'

with slightly different indexing semantics, and other more complex examples could make them entirely equivalent. This is especially true since DynamicTableRegion can also have a VectorIndex so it is a ragged array of indices into another table.

Problem

The many ways that it is possible to express references is the single greatest point of complexity in NWB and the nwb schema. it's possible to do a more or less linear translation of nwb schema language and nwb schema into other systems like linkml with the exception of references. Currently TimeSeriesReferenceVectorData is not even working in pynwb as far as i can tell, as an intracellular ephys response table just returns the literal HDMF dataset within a dataframe rather than the values of that dataset.

Relatively few backends support the idea of compound dtypes, causing problems like this where the special-cased compound dtype cause a tool-breaking perf problem. we should probably expect that to persist and hold us back if we want this format to survive into the future, since most other schema systems would just have a means of representing compound dtypes as a type with multiple attributes.

It seems to me like this was an older idea that has been surpassed by the dynamictable system, and keeping both around when they do the same thing probably multiplies the labor cost of maintaining the format by a nontrivial factor, as well as limits the number of contemporary formats it can be translated into. ik y'all are strapped for time but i think this would be one set of deprecations that would be net-positive for y'all in terms of "complex stuff for ya to deal with" <3

oruebel commented 2 months ago

Thanks for the detailed issue. Just a couple of quick comments to provide a little bit of additional background.

TimeSeriesReferenceVectorData is equivalent to DynamicTableRegion in functionality - Both are capable of representing ragged indexes into arbitrary objects.

DynamicTableRegion and TimeSeriesReferenceVectorData serve different purposes. Some of the key difference are:

Currently TimeSeriesReferenceVectorData is not even working in pynwb as far as i can tell, as an intracellular ephys response table just returns the literal HDMF dataset within a dataframe rather than the values of that dataset.

TimeSeriesReferenceVectorData is working in PyNWB to the best of my knowledge. When selecting elements in a TimeSeriesReferenceVectorData, it returnsTimeSeriesReferenceobjects to make it easy to slice directly into theTimeSeriesthat is being referenced based on the time range that is being specified in theTimeSeriesReference`.

It seems to me like this was an older idea that has been surpassed by the dynamictable system

The concept of referencing TimeSeries was first introduced when the TimeIntervals type (which is also DynamicTable) was added. When the new data structures for icephys where added we simply gave the existing column type in TimeIntervals a neurodata_type (i.e, TimeSeriesReferenceVectorData) to avoid duplication of code and functionality across TimeIntervals and the icephys metadata tables. I.e., TimeSeriesReferenceVectorData was introduced after DynamicTable and DynamicTableRegion. The main function TimeSeriesReferenceVectorData is to make it easy to indicate specifically the TimeSeries a user wants to annotate (or select data from) via explicit indices so we can directly access the data (whereas, TimeIntervals stores actual times in seconds that need to be converted to indices in time (the reason being that a single row in TimeIntervals table may apply to multiple TimeSeries)

sneakers-the-rat commented 2 months ago

ha i knew i should have made this two separate issues. my mistake.

the reason i linked the two things together here is mostly that TimeSeriesReferenceVectorData seems like the one place that compound dtypes are used in a way that is intrinsic to the class design (ie. the position uses seem like they could be done equivalently with an n * 3 array or 3 VectorData columns, etc.)

the goal here is to think about ways that we can collapse the number of schema language constructs to simplify implementation/increase interoperability/reduce maintenance burden, so to take account of the ways that references can be made:

type method source domain target range selection
schema links groups groups, datasets object
schema Reference dtype dataset, attribute group, dataset object, region
implementation DynamicTableRegion VectorData DynamicTable region of single object
implementation VectorIndex VectorData VectorData region of single object
implementation TimeSeriesReferenceVectorData VectorData TimeSeries region of many objects

any others?

So it seems like we currently have separate implementations for what amount to two fundamental reference operations:

currently several of the reference types are done by convention in the HDMF and pynwb implementation - not strictly based in the schema, but well defined nonetheless - could we refactor the notion of references in the schema lang to simplify them?

It seems like reftype: region is currently unused in the core schema, could it be?

HDF5 supports region references natively, and otherwise they can be serialized as a object_id, slice tuple.

So then rather than having a specific TimeSeriesReferenceVectorData class, it just becomes dtype: { reftype: region }

This simplifies the various types of references that exist in a way that could be compatible with existing API, be grounded in the schema, and have a single method of serialization in non-hdf5 formats.

So then compound dtypes can be removed, and TimeSeriesReferenceVectorData is not a special case but can be kept to it can be neurodata_type_inc'd as a shorthand for that pattern. (While we're at it we can remove links and turn those into attributes with an object reference. i am not sure if they are supposed to behave differently than object references, but they seem to have a sort of mixed implementation, sometimes attributes, sometimes groups, etc. but anyway that's another separate convo).

Currently I am not exactly sure how to implement a TimeSeriesReferenceVectorData with an _index (as per TimeIntervals), since it would be a (potentially) ragged array of (potentially) ragged arrays, which can then be referred to by another DynamicTableRegion column in a different table, etc. and the compositionality becomes hard with the switching between different modes of indexing. reducing to a single region reference type would simplify that considerably i think!

As usual i'm probably way off here, but just trying to brainstorm how we might simplify this bc it's definitely been the most complex part of implementing the format, and i'm thinking about the future of a very interoperable NWB, and ease of implementing bridges/adapters is key to that <3


TimeSeriesReferenceVectorData is working in PyNWB to the best of my knowledge. When selecting elements in a TimeSeriesReferenceVectorData, it returns TimeSeriesReferenceobjects

aha, i wasn't sure this was how it was supposed to be since VectorIndex directly materializes the values. i think the need for a specific TimeSeriesResponses class for this is sorta illustrative of what i'm talking about here (though i get why it exists).


The concept of referencing TimeSeries was first introduced when the TimeIntervals type (which is also DynamicTable) was added.

thanks for the historical note, makes sense and i always love hearing about how the needs and constraints were approached given the state of the format at the time <3