INCF / neuroscience-data-structure

Space for discussion of a standardized structure (directory layout + metadata) for experimental data in systems neuroscience, similar to the idea of BIDS in neuroimaging
21 stars 3 forks source link

File formats for data and metadata #2

Open apdavison opened 4 years ago

apdavison commented 4 years ago

We propose to limit the file formats that are allowed to be placed within the directory structure. All allowed formats should have open, non-proprietary specifications.

Some suggestions to get the ball rolling:

samuelgarcia commented 4 years ago

A quick comments on Andrew.

jcolomb commented 4 years ago

About format restriction:

For metadata format, I think what made the wide adoption of bids was the possibility to have metadata in the .tsv format (spreadsheets). Most researchers still think a computer is an advanced typewriter and will just run away when they hear NIX or NWB :)

chrisvdt commented 4 years ago

This may be one of the most important issues for establishing a new standard. How to deal with metadata. Metadata structures are so overwhelmingly large and diverse, due to the diverse types of experimental research in neuroscience, I do not think we will come up with one standard to fit all. So in general I would propose that researchers save metadata in their preferred format, but associate this with a script that retrieves this metadata for further use. Metadataread could be a formal building block of this new data structure. Aside from this, you can define very global metadata that can be stored in a readable json file with each recording session. Global identifiers that are necessary for any type of research. For example, for any experimental recording I would like to know with which project it is associated, what subject (animal or preparation), what method (but not sampling rate or pixel depth). So I would propose, please start from the top, with the most general stuff we can think of, work slowly down, and leave the rest to "Metadataread".

chrisvdt commented 4 years ago

An interesting thought: Could we think in terms of object oriented programming. First define a base class that defines the basic metadata and folder structure, with virtual functions (e.g. to read metadata, and select datasets). This base class could be used within evermore complicated child classes that express different types of research, but always need to implement these base functions and metadata, thus giving users of any dataset access to these basic functions and metadata.

SylvainTakerkart commented 4 years ago

An interesting thought: Could we think in terms of object oriented programming. First define a base class that defines the basic metadata and folder structure, with virtual functions (e.g. to read metadata, and select datasets). This base class could be used within evermore complicated child classes that express different types of research, but always need to implement these base functions and metadata, thus giving users of any dataset access to these basic functions and metadata.

I guess using a proper schema (like the json schema mentionned here https://github.com/INCF/neuroscience-data-structure/issues/4#issuecomment-690383638) would be a first step in the good direction, right?

SylvainTakerkart commented 4 years ago

in general, a question that emerges from some of the posts above seems to be "restricting to a few formats VS. leaving this open"; we've had this discussion at our institute also, and because we'd like to have support for several modalities (ephys, various forms of optical imaging etc.), we tend towards leaving the possibility to store the raw data in a proprietary format... our opinion is that:

samuelgarcia commented 3 years ago

About metadata. We certainly at a point need to discuss the probe gemotry.

Here a link to a new project that handle this : https://probeinterface.readthedocs.io/en/main/

It could embeded in our structure.

robertoostenveld commented 3 years ago

Regarding probes (or grids or shafts as they are called in ECoG and sEEG respectively): for iEEG there are already some elements in the specifications for that. Perhaps not as detailed as required for animal electrophysiology, but I recommend to try out how the current BIDS specification would work out for some common animal probes. Being able to look at an example "dataset" (no actual data needed) with the metadata will help to identify where the current version is lacking.

bendichter commented 3 years ago

@samuelgarcia and @SylvainTakerkart I have some thoughts on standardizing based on API vs. format.

When I started with NWB, I thought every lab would have a different neurophysiology format. In fact, it's even worse- every individual researcher has a different format, and it can be hard for people to collaborate even within a lab! This is made even worse as labs advance from one technology to another, and the knowledge of how to read old data is gradually lost. Trust me, the space of all neurophys data formats is enormous, and causes a lot of friction! There are two possible approaches to wrangling this heterogeneity- standardizing via API and standardizing via data format. The API approach may seem attractive at first, because it requires the least up-front work and does not require copying any data, but it has major downsides that will become huge problems as it scales.

Don't get me wrong, NEO is a very useful, important, and high quality tool. We rely heavily on NEO for our NWB conversions, and it has allowed us to move much faster in supporting conversion for a large variety of proprietary formats. However, working with NEO is a constant development process where we are always working to support more formats and format versions.

The first problems is validation of supported formats. There are always new data formats, and new versions of old formats, and they are often not very well documented (sometimes not documented at all). Therefore, it is imperative that we be able to validate that any contributed file follows one of the supported versions of the supported file formats. Confirming this would require building validators for every allowable data format, which would be impractical for a large number of formats. In contrast, standardizing based on a small number of formats would require a manageable number of validators.

Second of all, if you are standardizing based on a API, you are locking users into a single programming language. According to our surveys, about half of neurophysiologists use MATLAB. You might consider this a forward-thinking initiative that is unconcerned with the MATLAB laggards and wants to push them into Python. I think it is a mistake for this initiative to prescribe data analysis patterns instead of responding to them, but even if you do want this format to push the field forward, you have the problem that you are locking users to Python. What if users want to use Julia, or some other language in the future? Are you going to re-create all of NEO for Julia? In NWB we have run into several applications where users want to access NWB files from outside of our supported APIs, and they have built APIs in C++, C sharp, and R. They were able to do this because there is a standard file format. I love Python, but I do not want us to feel bound to Python 5-10 years from now.

The third problem is that each of these varied formats has a different metadata. This is really the crux of NWB, which at its core is essentially a metadata dependency tree- If you have electrophysiology voltage traces, you'll need to say which electrodes it was recorded from, which means you need an electrode table. Then each electrode will need to be assigned an electrode group and each group assigned to a device, etc. etc. This is designed to ensure that all of the metadata necessary for re-analysis is in the NWB file. It is built to handle multiple cooccurring streams of data with different time bases. Proprietary formats, on the other hand, are generally not designed to contain all of the metadata necessary for reanalyses, but rather to report all of the relevant data from a particular acquisition system. The problem is that the gap between this set and the reanalysis set is different for every format.

The only way around these problems is to restrict to closed set of allowable data formats that can be validated. It doesn't have to be constrained to NWB and NIX, but it does need to be constrained to some set.

There are also downsides to standardizing on the format- You need to copy the data, you need to convert it, and you may be throwing out some acquisition system-specific data that could be important. I remember hearing of a compromise where there would be three folders, source (from the acquisition system), raw (converted to some standard), and processed data. I think this would provide the best of both worlds, because we could allow users to store their original data while providing a way to ensure that it is readable in an archive format.

Having NWB or NIX force to have the data as float32 whereas many raw data are int16.

NWB is capable of storing data for any standard data type, and most raw data shared in NWB is int16, copied directly from the source file. In addition, NWB can apply chunk-wise lossless compression to datasets, which in our hands has reduced the dataset size by up to 66% for Neuropixel electrophysiology voltage traces. This is an HDF5 feature, so it is in theory possible for NIX as well, though I don't know whether their API exposes this feature.

samuelgarcia commented 3 years ago

Hi all. @bendichter : I totally anderstand your thoughts about API vs format.

As neo/spikeextractors dev of course, I like a lot the API approach, I won't develop my thoughts here.

I think the debate here is not format vs API but which format do we allow in this BIDS ? nix/nwb or nix/nwb/raw ?

Here some pros/cons in favor of adding the "raw format" in the possible formats.

CONS:

  1. Need to have a small txt based header along the binary file with:

    • byte offset (in files)
    • dtype
    • number of channel
    • sampling_rate
    • optionaly channel gains

    This is very easy to manage!! In any language.

PROS:

  1. parralel read/write super easy in any language
  2. no need of of lib (h5)
  3. This would cover main already existing format in the fields:
    • openephys
    • spikeglx
    • blackrock
  4. spikeinterface when doing sorting internannly copy the dataset into a binary It is the case for kilosort (for instance) unless it is already in binary. Having a hdf5 storage for traces would lead this : rawbinary after device > hdf5 conversion > raw binary for sorting.
bendichter commented 3 years ago

@samuelgarcia, you know there is a need for representing different time bases. That's exactly what HDF5 and NIX are built for! I can't speak to the details of NIX, but NWB can handle multiple streams that are started at potentially different times, as well as multiple disjoint segments of recording from the same device.

Let me explain some of the features of HDF5 that I prefer over raw binary. HDF5 and PyNWB do support parallel I/O. See the tutorial here. They also support features like chunking and lossless compression by chunk, which can save a lot of space and time for large datasets and are not available for stand-alone binary files. We have seen some Neuropixels datasets reduce in size by 66% when using these tools! You also don't necessarily need to use the h5 library to read HDF5 files. We are working on a Zarr library that reads HDF5 NWB files without touching h5 here. If they are non-chunked, reading an HDF5 dataset is as simple as passing the offset, shape, and data type to np.memmap. We have also developed a way to stream data directly from datasets stored on s3, which can be used to download pieces of large datasets on the DANDI archive. See details about this here.

SylvainTakerkart commented 3 years ago

just referencing another discussion (https://github.com/bids-standard/bep021/issues/1) coz there's some interesting overlap with the one here!

apdavison commented 3 years ago

I think we agree that it is desirable to have data in a standardized format that allows rich metadata annotation, like NWB. The question is whether we will make more rapid progress if we require use of such a format rather than just encouraging it?

The main disadvantage of requiring it is that people who might otherwise have used a standardized directory layout with simple, minimal metadata will decide it is too much work and not share at all, or just dump everything in a zip file.

The main disadvantages of not requiring it are that (i) important metadata will be lost to the passage of time (due to data providers forgetting details, leaving the field, moving labs, etc.); (ii) the possibilities for automation of subsequent data analysis pipelines are reduced.

We could imagine having a two-tier validator: datasets using a recommended format are Tier 1 / gold. Datasets with only the source/raw format are still valid, but Tier 2 / silver.

(as an aside, referencing Ben's comment above, Neo was originally intended to be language-independent. This is why the Github repo is called "python-neo". We planned a "matlab-neo", but never had the resources to work on it. "julia-neo" would also be interesting).

satra commented 3 years ago

The question is whether we will make more rapid progress if we require use of such a format rather than just encouraging it?

indeed i think that is the key question. the issue is reusability. and this is one of those FAIRness concepts that is often overlooked by a lab mostly because the lab focus is not on use by others, but by themselves and presently there is relatively little reuse of other data (in comparison to neuroimaging). also, in neuroimaging, the adoption of NIfTI significantly enhanced reusability and is one of the key reasons why a platform like nipype exists, as people could mix and match software knowing that each of the tools could read Nifti.

the situation with neurophysiology is undoubtedly more complex at the moment. however, the role bids and the archives have played is to push towards that common space through validation. i don't think we should allow the random format as a standard, otherwise what kind of a standard is it, and how does an arbitrary consumer of the data read it? so i suspect "raw" in this case is still not raw, but has some structure like dimensions, datatypes, etc.,. if people are going down this road, i would at least suggest considering zarr as a potential option as ben indicated, as it still provides a bit of a model to structure the data. however zarr presently does not have any matlab bindings, except through using python in matlab :) .

jgrethe commented 3 years ago

I agree that random formats should not be part of a specification. However, there can be a set of formats that are open or easily converted. Someone could still include data in a random format in a "raw" space as long as the primary data is in a usable format as Satra mentions.

jbpoline commented 3 years ago

Also agree, and I remember that some of these same discusssions happened with BIDS, and it was eventually decided to go for a common nifti format because of the arguments laid out above.

apdavison commented 3 years ago

@jgrethe there are a few tools available that can read a wide range of ephys formats, e.g. Neo, SpikeInterface, SpykingCircus, so one option would be to support any format that can be read/converted by at least two different tools (while still recommending open, standard formats like NWB).

jcolomb commented 3 years ago

just want to comment again on re-usability, as I have spent the last 30 minutes trying to open a .nwb file, and failed so far.

I imagine people who would make their behavior data into a nwb format with the idea of insuring re-use, and it would in the end make it more difficult for people to re-use it. While I can see why one should invest time in using these tools, it is not fitted for all types of data (while the structure should).

yarikoptic commented 3 years ago

@jcolomb I share your pain, and FWIW I shared it (if I guess right the reasons for the pain) with NWB developers and they are working to address it all

bendichter commented 3 years ago

@jcolomb I'm sorry to hear you are having trouble opening an NWB file. I know that can be frustrating. I'd be happy to help if you could give me enough information to diagnose the problem, e.g. what file you are trying to open, what commands you are using, and what error you are getting. I'll send you a message on the NWB slack channel, as I think this is off-topic for the current thread.

jcolomb commented 3 years ago

thanks @yarikoptic for the laugh in this Friday morning, it does help a lot, I will of course discuss my issue on a nwb specific place (slack), my point was not the pain of using nwb, but the adoption issue. I think we got a bit lost in ephy data details, and should also consider other data types (live and structural imaging, DNA/RNA experiments, behavior, proteomics, surveys) which analysis is in most cases not done in matlab or python.

But maybe we need to be more constructive and start collecting information: what kind of data the standard should support, what raw data exists, what open standards exists and what tool can read what ? I am starting such a collection for a data management plan I have to write...

SylvainTakerkart commented 3 years ago

hi @jcolomb ! yes, I agree, we've collectively (I mean, in this group and at this moment) drifted towards focusing on the ephys case; that goes with the practical progress we've made and our current proposal to extend BIDS for ephys data recorded in animals... but overall, a consensus has emerged that trying to go with BIDS / extend BIDS for whatever modality where it's possible might be a solution worth pursuing!

so, fyi, in parallel to our BIDS extension proposal for animal ephys ( BEP032 ) , there is another one dedicated to microscopy ( BEP032 ) which should cover most of the imaging needs ; between these two BEPs, we share the need of supporting animal data in BIDS, which is now discussed at the global BIDS level here

for behavior and omics, there are also other BIDS extension proposal I think... @yarikoptic @satra ?

all this of course does not mean that BIDS is the only solution, but moving forward in practice with this solution should be beneficial for the community ;)

jbpoline commented 3 years ago

@SylvainTakerkart : not sure for behaviour, but definitely for omics : led by C. Pernet IIRC

satra commented 3 years ago

at present behavior could be encoded in nwb (https://pynwb.readthedocs.io/en/latest/overview_nwbfile.html#processing-modules - see behavior) - at least that's what we are suggesting to dandi users. regarding omics, it's a completely scale. bids has some support for omics (https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/08-genetic-descriptor.html) which essentially is a simple metadata structure with a pointer to an external resource housing the data. for the brain initiative, the nemo data portal is housing transcriptomics and dandi is housing some proteomics (through immunostaining via BEP032; example dataset here: https://dandiarchive.org/dandiset/000026).

satra commented 3 years ago

for surveys and other kinds of data (e.g., voice recordings) that we can collect online, we have been building a reproschema via the ReproNim project, a JSONLD based specification for both the questionnaire side as well as the response side. and for actigraphy and other data there are some efforts to consolidate in other projects.

SylvainTakerkart commented 3 years ago

I just want to link here to a related discussion that takes place around the BIDS specs...

https://github.com/bids-standard/bids-specification/issues/197