Open sneakers-the-rat opened 2 months ago
Thanks for the detailed issue. Just a couple of quick comments to provide a little bit of additional background.
TimeSeriesReferenceVectorData
is equivalent toDynamicTableRegion
in functionality - Both are capable of representing ragged indexes into arbitrary objects.
DynamicTableRegion
and TimeSeriesReferenceVectorData
serve different purposes. Some of the key difference are:
DynamicTableRegion
targets DynamicTable
objects to select strictly rows from a table while TimeSeriesReferenceVectorData
targets selection of ranges in time in a TimeSeries
DynamicTableRegion
targets a single DynamicTable
whereas in TimeSeriesReferenceVectorData
each element can target a different TimeSeries
. DynamicTableRegion
can, hence, store a single reference to a DynamicTable
as an attribute, whereas TimeSeriesReferenceVectorData
stores the reference as part of the elements of the dataset itself.Currently
TimeSeriesReferenceVectorData
is not even working in pynwb as far as i can tell, as an intracellular ephys response table just returns the literal HDMF dataset within a dataframe rather than the values of that dataset.
TimeSeriesReferenceVectorData
is working in PyNWB to the best of my knowledge. When selecting elements in a TimeSeriesReferenceVectorData, it returns
TimeSeriesReferenceobjects to make it easy to slice directly into the
TimeSeriesthat is being referenced based on the time range that is being specified in the
TimeSeriesReference`.
It seems to me like this was an older idea that has been surpassed by the dynamictable system
The concept of referencing TimeSeries
was first introduced when the TimeIntervals
type (which is also DynamicTable
) was added. When the new data structures for icephys where added we simply gave the existing column type in TimeIntervals
a neurodata_type
(i.e, TimeSeriesReferenceVectorData
) to avoid duplication of code and functionality across TimeIntervals
and the icephys metadata tables. I.e., TimeSeriesReferenceVectorData
was introduced after DynamicTable
and DynamicTableRegion
. The main function TimeSeriesReferenceVectorData
is to make it easy to indicate specifically the TimeSeries
a user wants to annotate (or select data from) via explicit indices so we can directly access the data (whereas, TimeIntervals
stores actual times in seconds that need to be converted to indices in time (the reason being that a single row in TimeIntervals
table may apply to multiple TimeSeries
)
ha i knew i should have made this two separate issues. my mistake.
the reason i linked the two things together here is mostly that TimeSeriesReferenceVectorData
seems like the one place that compound dtypes are used in a way that is intrinsic to the class design (ie. the position uses seem like they could be done equivalently with an n * 3
array or 3 VectorData
columns, etc.)
the goal here is to think about ways that we can collapse the number of schema language constructs to simplify implementation/increase interoperability/reduce maintenance burden, so to take account of the ways that references can be made:
type | method | source domain | target range | selection |
---|---|---|---|---|
schema | links |
groups | groups, datasets | object |
schema | Reference dtype |
dataset, attribute | group, dataset | object, region |
implementation | DynamicTableRegion |
VectorData | DynamicTable | region of single object |
implementation | VectorIndex |
VectorData | VectorData | region of single object |
implementation | TimeSeriesReferenceVectorData |
VectorData | TimeSeries | region of many objects |
any others?
So it seems like we currently have separate implementations for what amount to two fundamental reference operations:
*
-> whole object (group/dataset/attribute)*
-> region of object (either single value or ragged selection)currently several of the reference types are done by convention in the HDMF and pynwb implementation - not strictly based in the schema, but well defined nonetheless - could we refactor the notion of references in the schema lang to simplify them?
It seems like reftype: region
is currently unused in the core schema, could it be?
HDF5 supports region references natively, and otherwise they can be serialized as a object_id, slice
tuple.
So then rather than having a specific TimeSeriesReferenceVectorData
class, it just becomes dtype: { reftype: region }
This simplifies the various types of references that exist in a way that could be compatible with existing API, be grounded in the schema, and have a single method of serialization in non-hdf5 formats.
VectorData
/VectorIndex
: two 1D datasets, one with values, and the other with RegionReference
s into that. as a bonus, now indices don't need to be contiguous:
vectordata = h5f.create_dataset('vectordata', data=np.arange(100))
vectorindex = h5f.create_dataset('vectorindex', data=[vectordata.regionref[0:10], vectordata.regionref[10:25]], dtype=h5py.ref_dtype)
DynamicTableRegion
: a VectorData
instance with a 1D array and dtype: { target_type: DynamicTable, reftype: region }
(no longer needs a VectorIndex
for ragged arrays, as the region reference can be singular or ragged, and also no longer needs table
attribute since the table references are per-cell).TimeSeriesReferenceVectorData
: any dataset with a 1D array and dtype: { target_type: TimeSeries, reftype: region }
So then compound dtypes can be removed, and TimeSeriesReferenceVectorData
is not a special case but can be kept to it can be neurodata_type_inc
'd as a shorthand for that pattern. (While we're at it we can remove links
and turn those into attributes
with an object reference. i am not sure if they are supposed to behave differently than object references, but they seem to have a sort of mixed implementation, sometimes attributes, sometimes groups, etc. but anyway that's another separate convo).
Currently I am not exactly sure how to implement a TimeSeriesReferenceVectorData
with an _index
(as per TimeIntervals
), since it would be a (potentially) ragged array of (potentially) ragged arrays, which can then be referred to by another DynamicTableRegion
column in a different table, etc. and the compositionality becomes hard with the switching between different modes of indexing. reducing to a single region reference type would simplify that considerably i think!
As usual i'm probably way off here, but just trying to brainstorm how we might simplify this bc it's definitely been the most complex part of implementing the format, and i'm thinking about the future of a very interoperable NWB, and ease of implementing bridges/adapters is key to that <3
TimeSeriesReferenceVectorData is working in PyNWB to the best of my knowledge. When selecting elements in a TimeSeriesReferenceVectorData, it returns TimeSeriesReferenceobjects
aha, i wasn't sure this was how it was supposed to be since VectorIndex
directly materializes the values. i think the need for a specific TimeSeriesResponses
class for this is sorta illustrative of what i'm talking about here (though i get why it exists).
The concept of referencing TimeSeries was first introduced when the TimeIntervals type (which is also DynamicTable) was added.
thanks for the historical note, makes sense and i always love hearing about how the needs and constraints were approached given the state of the format at the time <3
implementing these now alongside dynamictables and it seems like these are mostly artifacts of history atp that persist bc of the relative rarity of intracellular ephys data nowadays.
Usage
compound dtypes are used 4 times in the schema:
TimeSeriesReferenceVectorData
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.base.yaml#L9ElectrodeGroup.position
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.ecephys.yaml#L243PlaneSegmentation.pixel_mask
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.ophys.yaml#L156PlaneSegmentation.voxel_mask
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.ophys.yaml#L175TimeSeriesReferenceVectorData
is used in 4 places (but actually 2)TimeIntervals.timeseries
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.epoch.yaml#L25IntracellularRecordingsTable
IntracellularStimuliTable.stimulus
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.icephys.yaml#L287IntracellularStimuliTable.stimulus_template
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.icephys.yaml#L290IntracellularResponsesTable.response
: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/63ac845b2e1afdf6fc79466c1feafdd34538bb10/core/nwb.icephys.yaml#L304Purpose
The function of compound dtypes are to group multiple values together into something that behaves like a tuple.
The function of
TimeSeriesReferenceVectorData
is to be a 1-d vector into arbitrary contiguous spans in arbitrary other datasets.Degeneracy
compound dtypes are equivalently well-served by datasets with multiple attributes. There should be a single
Position
type and it should havex, y, z
as attributes. the hdmf/nwb schema language is actually relatively uniquely able to do this with its*_type_inc
that behaves both like a template inclusion as well as inheritance.TimeSeriesReferenceVectorData
is equivalent toDynamicTableRegion
in functionality - Both are capable of representing ragged indexes into arbitrary objects.Specifically this:
is functionally equivalent to this
with slightly different indexing semantics, and other more complex examples could make them entirely equivalent. This is especially true since
DynamicTableRegion
can also have aVectorIndex
so it is a ragged array of indices into another table.Problem
The many ways that it is possible to express references is the single greatest point of complexity in NWB and the nwb schema. it's possible to do a more or less linear translation of nwb schema language and nwb schema into other systems like linkml with the exception of references. Currently
TimeSeriesReferenceVectorData
is not even working in pynwb as far as i can tell, as an intracellular ephys response table just returns the literal HDMF dataset within a dataframe rather than the values of that dataset.Relatively few backends support the idea of compound dtypes, causing problems like this where the special-cased compound dtype cause a tool-breaking perf problem. we should probably expect that to persist and hold us back if we want this format to survive into the future, since most other schema systems would just have a means of representing compound dtypes as a type with multiple attributes.
It seems to me like this was an older idea that has been surpassed by the dynamictable system, and keeping both around when they do the same thing probably multiplies the labor cost of maintaining the format by a nontrivial factor, as well as limits the number of contemporary formats it can be translated into. ik y'all are strapped for time but i think this would be one set of deprecations that would be net-positive for y'all in terms of "complex stuff for ya to deal with" <3