Storage schema for spike trains

ajtritt commented 6 years ago

Use three datasets:

id - 1D dataset (unit identifier for entities)
- this will point to a global table or other data structure providing information about this item
pointer - 1D dataset (start,stop into 'value' for entity identified by 'id')
- this could be a region reference, pending benchmarking
value - 1D dataset (values for entities identified by 'id')

Provide mechanism for creating index tables for querying. Index tables can created for arbitrary categories.

oruebel commented 6 years ago

Shouldn't pointer be a 1D dataset of region references> into value?

bendichter commented 6 years ago

example code without region references:

import numpy as np
import random

from pynwb.spec import NWBDatasetSpec, NWBNamespaceBuilder, NWBGroupSpec, NWBDtypeSpec
from pynwb import get_class, load_namespaces, NWBFile, NWBHDF5IO
from pynwb.form.backends.hdf5 import H5DataIO

from datetime import datetime

# define extensions
ns_path = "soltesz.namespace.yaml"
ext_source = "soltesz.extensions.yaml"

gid_spec = NWBDatasetSpec(doc='global id for neuron',
                          shape=(None, 1),
                          name='cell_index', dtype='int')
data_val_spec = NWBDatasetSpec(doc='Data values indexed by pointer', 
                                shape=(None, 1), 
                                name='value', dtype='float')
data_pointer_spec = NWBDatasetSpec(doc='Pointers that index data values', 
                                   shape=(None, 1),
                                   name='pointer', dtype='int')

gid_pointer_value_spec = [gid_spec, data_val_spec, data_pointer_spec]

PopulationSpikeTimes = NWBGroupSpec(neurodata_type_def='PopulationSpikeTimes',
                                    doc='Population Spike Times',
                                    datasets=gid_pointer_value_spec,
                                    neurodata_type_inc='NWBDataInterface')

CatCellInfo = NWBGroupSpec(neurodata_type_def='CatCellInfo',
                           doc='Categorical Cell Info',
                           datasets=[gid_spec,
                                     NWBDatasetSpec(name='indices',
                                                    doc='indices into values for each gid in order',
                                                    shape=(None, 1),
                                                    dtype='int'),
                                     NWBDatasetSpec(name='values',
                                                    doc='list of unique values',
                                                    shope=(None,1), dtype='str')],
                           neurodata_type_inc='NWBDataInterface')

ContCellInfo = NWBGroupSpec(neurodata_type_def='ContCellInfo', 
                            datasets=[gid_spec,
                                NWBDatasetSpec(name='data',
                                               doc='continuous values',
                                               shape=(None, None), dtype='float')],
                           doc='Continuous Cell Info',
                           neurodata_type_inc='NWBDataInterface')

ns_builder = NWBNamespaceBuilder('soltesz extensions', "soltesz")
ns_builder.add_spec(ext_source, PopulationSpikeTimes)
ns_builder.add_spec(ext_source, CatCellInfo)
ns_builder.add_spec(ext_source, ContCellInfo)
ns_builder.export(ns_path)

#######

# generate and save fake data
duration = 10
nspikes = 1000
nunits = 5

times = np.random.rand(nspikes) * duration
data_pointer = np.hstack([0,np.sort(random.sample(range(len(times)), nunits - 1))])
gids = np.arange(nunits).astype('int')
cell_types_vals, cell_types_inds  = np.unique(['ME']*3+['LE']*2, return_inverse=True)
pos = np.random.randn(nunits, 3)

## pull types from yaml files
load_namespaces(ns_path)
PopulationSpikeTimes = get_class('PopulationSpikeTimes', 'soltesz')
CatCellInfo = get_class('CatCellInfo', 'soltesz')
ContCellInfo = get_class('ContCellInfo', 'soltesz')

pop_data = PopulationSpikeTimes(name='example_population_spikes', source='source',
                                cell_index=gids, value=times, pointer=data_pointer)
cell_types = CatCellInfo(name='cell_types',source='source',
                         values=cell_types_vals, indices=cell_types_inds, cell_index=gids)
cell_pos = ContCellInfo(name='cell_x_pos', source='source', data=pos, cell_index=gids)

## write to file
f = NWBFile(file_name='tmp.nwb',
            source='me',
            session_description='my first synthetic recording',
            identifier='EXAMPLE_ID',
            session_start_time=datetime.now(),
            experimenter='Dr. Bilbo Baggins',
            lab='Bag End Labatory',
            institution='University of Middle Earth at the Shire',
            experiment_description='empty',
            session_id='LONELYMTN')

population_module = f.create_processing_module(name='0', source='source',
                                               description='description') 
population_module.add_container(pop_data)
population_module.add_container(cell_types)
population_module.add_container(cell_pos)

io = NWBHDF5IO('tmp.nwb', mode='w')
io.write(f)
io.close()

bendichter commented 6 years ago

@oruebel @ajtritt

Hey guys, I'm trying to implement RegionReferences here and I could use your help. I'm looking at http://docs.h5py.org/en/latest/refs.html and it looks like RegionReferences are actually pretty straightforward at first. All I would need is to:

values = myfile.create_dataset('values', (n, 1))
pointers_obj = [values.regionref[start_pointer: end_pointer]) for start_pointer, end_pointer in pointers]

however I am coming up against two obstacles. The first is that h5py mentions storing a reference object as a dataset, but not lists of reference objects as a single dataset as we would like to do in the "pointer" slot. Is this possible with hdf5/h5py? Second, it looks like I need to create the referenced dataset first, and then create the reference to that dataset second, which I think makes it impossible to write in the extension framework, because the datasets are all specified at the same time.

ajtritt commented 6 years ago

@bendichter The bookkeeping for doing all this is handled in HDF5IO. We just need to create schema and the front-end classes. So, don't worry about needing to create the referenced dataset first.

NeurodataWithoutBorders / nwb-schema

Storage schema for spike trains #117