Open shashwatsridhar opened 3 years ago
Hi @shashwatsridhar. On a first read this sounds like you are falling in the let-me-introduce-yet-another-standard-trap. I think the NixIO_fr might help you as it allows reading Nix files in a lazy mode, if the neo structure is raw-compatible (i.e. has the same number of channels across segments). Since you are also planning to switch to the latest version of the BlackrockIO this would anyway be the case for the data you are going to load in the future. Alternatively if you would like to still separate the different types of data into different files as you describe it above it would make sense to use an existing format (e.g. openephys, exdir) and extend the neo capabilities for that. Here, the first format can be read by neo, but not written yet and the latter one is on the list to be included in neo at some point in the future.
Given the recent new interest in Swan, I was trying to get Swan to work with a neo IO class which is compatible with newer versions of neo (currently we only support blackrockio_v4). My idea was to create a pipeline wherein the user converts her data to a common intermediary format that's easy to write to (eg. npy) and then provide a conversion script to convert the intermediary files to a neo compatible file format.
The problem Is, I couldn't find any format in neo that: 1) supports writing blocks, AND 2) supports lazy loading / channel-by-channel loading. For example, nixIO and pickleIO can write blocks just fine, but they cannot be loaded channel-by-channel or lazy loaded. For users with many sessions to analyze, this would quickly become intractable due to memory limitations.
One solution is to have the user convert their data to channel-by-channel intermediary files, and convert data from each channel to single .pkl files. While this would work, it does not seem like a very elegant solution, requiring two conversion steps to get data in a Swan compatible format.
An alternative solution would be to create a
SwanNumpyIO
class based onneo.BaseFromRaw
andneo.BaseRawIO
. This would read in folders corresponding to individual sessions, each containing certain required numpy files, and use numpy's memmap functionality to read in data channel-by-channel. This has three advantages that I can see:1) the users only need to convert their data to numpy (with the structure I propose below), and
2) swan is then relatively independent of neo release cycles, allowing for quicker bug fixes and improvements in data IO
3) data can be quickly loaded channel-by-channel
The numpy format I propose is as follows:
Each session is stored in a folder, whose name corresponds to the dataset name. The folder contains four files:
spikes.npy - a 3xN array with the following structure
spiketimes | . | . | . | . | . | . | . | . | . |
labels | . | . | . | . | . | . | . | . | . |
channels | . | . | . | . | . | . | . | . | . |
contains all information about the spikes and units in that dataset
easy to read and convert to neo Groups corresponding to units/clusters
waveforms.npy - an MxN array containing all waveforms corresponding to the spike times in the spikes.npy file - M is the dimension of each waveform
events.npy - the timestamps and names of all experimental events in the segment are recorded in this 2xN array, with the following structure
metadata.json - any additional metadata corresponding to the data, stored in the form of nested dicts
(I'm still not sure of the precise structure of the metadata.json file)
I have never implemented a neoIO class, so I might be misjudging the complexity of the task itself. I was hoping @JuliaSprenger and @mdenker could share their thoughts and insights here. Do you think it's worth it?