[ENH] Proposal for multidimensional array file format

tyarkoni commented 5 years ago

At the BIDS-ComputationalModels meeting, it became pretty clear that a wide range of applications require (or would at benefit considerably from) the ability to read in generic n-dimensional arrays from a binary file. There are at least two major questions that should be discussed here, and then we should move to draft a PR modifying the BIDS-Raw specification:

What file format should we use? This should be something generic enough that it can be easily read on all common platforms and languages. The main proposals that came up at the meeting were for numpy (.npy) or HDF5 containers (.h5). While numpy is technically a Python format, it's sufficiently simple and well-supported that there appear to be available libraries for the major languages. Please suggest other potential solutions.
How and where should we represent associated metadata? The generic file format (and naming conventions, etc.) will eventually described in the BIDS-Raw spec, alongside all of the other valid formats (.tsv, nifti, etc.). But some applications are likely to require fairly specific interpretations of the data contained in the file. There appears to be some convergence on the notion of representing the relevant metadata in relevant sections of the BIDS-Derivatives spec (or current BEPs)—i.e., every major use case would describe how data in the binary array format should be interpreted when loaded. We could also associate suffixes with use cases, so that a tool like PyBIDS can automatically detect which rules/interpretations to apply at load time. But if there are other proposals (e.g., a single document describing all uses cases), we can discuss that here.

I'm probably forgetting/overlooking other relevant aspects of the discussion; feel free to add to this. Tagging everyone who expressed interest, or who I think might be interested: @johngriffiths @maedoc @effigies @yarikoptic @satra.

effigies commented 5 years ago

Note that there are two versions of npy, so compatibility levels of 1 and 2 should be assessed.

My primary concern with npy is that npy is not compressed, npz is just a zip of a directory of npy files, which almost certainly won't handle random read access as well as HDF5.

My primary concern with HDF5 is that it's just a container, and we will find ourselves defining formats. Perhaps just saying it contains only a dataset with name /data or similar will resolve that.

satra commented 5 years ago

in a different world, but probably related to computational models BRAIN has funded development of the NWB standard. to the extent that needs may become similar, it may be worthwhile thinking about supporting NWB in BIDS.

this will make the metadata world both easier (included in the NWB file) and harder (non conformant with BIDS), depending on your point of view. however, the NWB folks are also considering alternatives like exdir, which is like HDF5 but with external metadata and binary blobs as numpy files.

arokem commented 5 years ago

Sorry: could I ask for a bit more context? What kind of data will be stored in these files? If it's large enough to justify parallel processing of its contents, allow me to throw in a plea to consider zarr compatibility. I think that HDF5 could be made to play nice with zarr.

effigies commented 5 years ago

@satra In principle that seems fine, but their HDF5 format looks basically like HDF5 + some mandatory metadata, so if flexibility is a potential downside, it persists.

If it's not a downside, then I have no principled objection.

@arokem The issue driving us here is less the size of the data than the dimensionality. That said, there's no reason that the files couldn't get large enough for random and parallel access to be concerns, which is why I think HDF5 is my inclination (despite my above-noted reservations). The goal is wide interoperability (in particular, C, R, MATLAB and Python) and not reinventing the wheel, so if that format fits, I for one am happy to consider it.

satra commented 5 years ago

@arokem - the NWB folks are also considering zarr compatibility, especially with the N5 API. which would also constrain HDF5, since N5 doesn't support all aspects of it.

arokem commented 5 years ago

Yup. For reference: https://github.com/NeurodataWithoutBorders/pynwb/issues/230

yarikoptic commented 5 years ago

On one hand I am in strong favor of reusing someone else's "schema" and possibly "tooling" on top of HDF5 (container)! NWB might (do not know how well it aligns with the needs of ComputationalModels metadata) be a good choice. Import/export "compatibility" with other computation-oriented formats (like zarr) might be a plus.

BUT thinking in conjunction with 2. -- if we choose a "single file" container format to absorb both (data and metadata), we would step a bit away from "human accessibility" of BIDS. We already have an issue of metadata location duality, e.g. it being present in the data files (nii.gz) headers -- "for machines", and some (often additional but some times just duplicate) in side car files -- "for machines and humans" (related - recent #196). Sure thing bids-validator could assure consistency, but we subconsciously trying to avoid such redundancy, and I wonder if that might still be a way to keep going. May be there is a lightweight format (or some basic "schema" for HDF5) which would not aim to store any possible metadata, but just store minimally sufficient for easy and unambiguous IO of multi-dimensional arrays (if that is the goal here). And then pair it up with the side car .json file convenient access to metadata (defined in BIDS, if there is no existing schema for "ComputationalModels" elsewhere to reuse; not duplicated in the actual data file) for easy human and machines use (without requiring to open the actual data file which would require tooling)? If we end up with a single file format to contain both -- I think we might need to extract/duplicate metadata in a sidecar file anyways for easier human (and at times tools) consumption.

tyarkoni commented 5 years ago

@yarikoptic sorry, I realize on re-read that I wasn't clear, but your proposed approach (putting metadata in the json sidecar and only the raw ndarray in the binary file) is exactly what we seemed to converge on at the end of the BIDS-CM meeting. (I.e., the sidecar would supply the metadata needed to interpret the the common-format array appropriately for the use case specified in the suffix.)

On Fri, Apr 5, 2019, 17:29 Yaroslav Halchenko notifications@github.com wrote:

On one hand I am in strong favor of reusing someone else's "schema" and possibly "tooling" on top of HDF5 (container)! NWB might (do not know how well it aligns with the needs of ComputationalModels metadata) be a good choice. Import/export "compatibility" with other computation-oriented formats (like zarr) might be a plus.

BUT thinking in conjunction with 2. -- if we choose a "single file" container format to absorb both (data and metadata), we would step a bit away from "human accessibility" of BIDS. We already have an issue of metadata location duality, e.g. it being present in the data files (nii.gz) headers -- "for machines", and some (often additional but some times just duplicate) in side car files -- "for machines and humans" (related - recent

196 https://github.com/bids-standard/bids-specification/issues/196).

Sure thing bids-validator could assure consistency, but we subconsciously trying to avoid such redundancy, and I wonder if that might still be a way to keep going. May be there is a lightweight format (or some basic "schema" for HDF5) which would not aim to store any possible metadata, but just store minimally sufficient for easy and unambiguous IO of multi-dimensional arrays (if that is the goal here). And then pair it up with the side car .json file convenient access to metadata (defined in BIDS, if there is no existing schema for "ComputationalModels" elsewhere to reuse; not duplicated in the actual data file) for easy human and machines use (without requiring to open the actual data file which would require tooling)? If we end up with a single file format to contain both -- I think we might need to extract/duplicate metadata in a sidecar file anyways for easier human (and at times tools) consumption.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-specification/issues/197#issuecomment-480428049, or mute the thread https://github.com/notifications/unsubscribe-auth/AASjPPjWbCVDMwgkTtTQsKJka3ffwVxeks5vd8AsgaJpZM4cfliR .

yarikoptic commented 5 years ago

@yarikoptic sorry, I realize on re-read that I wasn't clear, but your proposed approach (putting metadata in the json sidecar and only the raw ndarray in the binary file) is exactly what we seemed to converge on at the end of the BIDS-CM meeting. (I.e., the sidecar would supply the metadata needed to interpret the the common-format array appropriately for the use case specified in the suffix.)

I am delighted to hear that similar minded us independently decided to contribute the XXXX-th model of the wheel to the humanity!

FWIW, I ran into https://news.ycombinator.com/item?id=10858189 on https://cyrille.rossant.net/moving-away-hdf5/ (even @chrisgorgo himself commented on there) -- seems a good number of groups/projects ended up switching from HDF5 to some ad-hoc data blob + metadata files "format". May be it would be worth stating the desired features (I think those weren't mentioned)? e.g. which among the following would be most important?

portability and library support -- probably a must...
efficient random access / slicing / ... - desired or not?
- relates to parallel processing etc. if just a "good to have" then probably not worth jumping to anything fancy
memory mapping - desired or not?
compression - desired or not? optional?

or in other words - aiming for processing or archival? if aiming for archival - probably compression is heavily desired... may be could be optional (we already have both .nii and .nii.gz supported IIRC, so could be .blob[.gz])... kinda boils down to .npy - which was also the choice at https://cyrille.rossant.net/moving-away-hdf5/ ;-)

satra commented 5 years ago

@yarikoptic - be careful of that blog post (i think it leads a lot of people astray), and do read all the threads that have emanated from it. for every such use case its easy to point to MATLAB and say that they use it for their base data format. also there are enough posts out there to also say that people who moved away ended up requiring many of the facilities of hdf5 and switching back to it. finally you should take a look at exdir and zarr as well as pointed in earlier threads, and in this followup thread to cyrille's original post and it's comments including the earliest one by konrad hinson (https://cyrille.rossant.net/should-you-use-hdf5/).

at the end of the day it's mostly about blobs and metadata. what one uses to house and link these things is indeed going to keep evolving depending on use cases. so i think the important thing is to think of the use cases, in both short term and to the extent possible longer term.

i like the questions that you have raised, and i think more than the format itself, the thought process should be around those required features, including archiving.

i'm not saying hdf5 is the answer here nor am i saying hdf5 is issue free, but i have also used it through MATLAB and Python over many years, for my use cases, without an issue. i would need to know their specific goals, applications, and use cases to make an informed judgment.

maedoc commented 5 years ago

We've made simple use of HDF5 (often just one or two datasets) for heavy numerical data (well, MB to TBs) in TVB, a computational modeling environment, for the last 7 years without the problems cited in Rossant's blogpost, mainly by keeping usage simple and heavily vetting library usage prior to version changes. I'd expect transparent compression (lz4 has nearly no CPU overhead) and memmapping are particularly useful for BIDS CM.

effigies commented 5 years ago

I've asked the participants in the computational models meeting to contribute their specific use cases, but I'll try to summarize according to my memory.

1) Visual stimuli, which are 2D arrays of luminance/RGB (or similar) values + time. NIfTI has been used to include these, but it's somewhat an abuse of the format.

2) Machine-learning training corpora, which will have an item dimension that will often be shuffled on multiple training runs, and other dimensions that have meaningful structure such as space or time which should be preserved.

3) Simulation state variables. Environment states will look similar to corpora, with some spatial structure, a time dimension, and potentially many runs. Simulated states may or may not be spatially ordered, but still don't fit NIfTI well.

4) Per-ROI covariance matrices. In the general discussion of statistical outputs, per-voxel statistics are easily represented in NIfTI, and even covariance matrices can be packed into dimensions 5 and 6 of NIfTI. For ROI-based outputs, we have the morphometry and timeseries examples to go by for packing single statistics or time series into TSVs, but multiple dimensions per entry would not work easily. We can get around it by having one file per matrix, and that would presumably be an option, but for large numbers of variables or ROIs, a multidimensional array structure would be useful.

I think there were a couple other examples, but as it became clear that some kind of multidimensional array would likely be the result, we did not compile a specific enumeration of all the needed properties, so hopefully we'll get some feedback.

Perhaps @maedoc can clarify the TVB uses that aren't suited to TSV/NIfTI, and what their minimal criteria and additional desiderata are.

maedoc commented 5 years ago

TVB uses that aren't suited to TSV/NIfTI

Surfaces & sparse matrices come to mind; these have straightforward serializations to arrays, so I would specify conventions for the serialization (e.g. faces, triangles, 0-based; sparse CSR, 0-based) instead of worrying about a new format.

effigies commented 5 years ago

Surfaces will be covered in GIFTI. What do you currently use HDF5 for?

maedoc commented 5 years ago

What do you currently use HDF5 for?

We don't use HDF5 for relational metadata, which is stored in an SQL DB and sidecar XML files, but just about everything else.

effigies commented 5 years ago

Okay.

To get back to @yarikoptic's desiderata:

portability and library support -- probably a must...

Agreed, this is most important IMO.

efficient random access / slicing / ... - desired or not?

relates to parallel processing etc. if just a "good to have" then probably not worth jumping to anything fancy

memory mapping - desired or not?

I see these three as basically related. Whether you want slicing for parallel access or just to avoid loading a ton of memory, if this isn't provided, the thing people are going to do is immediately convert to something that can be chunked for good performance over the desired slices and mmaped. Maybe they'll do it out of love for BIDS, but conversions are an adoption hurdle, to my mind.

compression - desired or not? optional?

I guess I'd say it should be an option. There are dense data that are difficult to compress where mmap access is going to be a higher priority, but there's also going to be sparse data that would be ridiculous to store without compression.

I may be prematurely pessimistic, but I don't see much hope for pleasing even a simple majority of people with any of the choices discussed here. (I may be projecting and it is just the case that I won't be pleased by my prediction of the majority's choice.) Another option to consider is not requiring a specific binary format, letting apps deal with the choice, and wait for some consensus to emerge in the community. If in a few years all MD arrays are, say, .npy/.npz files, then we can just acknowledge it in BIDS 2.0.

I would then add these conditions:

One MD array per file (or directory, if exdir is used)
Future-proofing
1. Open formats
2. Optional lossless compression with an open codec
Standard BIDS metadata
1. JSON sidecars, with metadata to be defined for each data type
2. In-file metadata must match JSON metadata where duplication occurs

maedoc commented 5 years ago

I don't see much hope for pleasing even a simple majority of people with any of the choices discussed here

JSON is hardly ideal, but once it's chosen, use cases and implementations can get done, exploring the positives/negatives of the choice. You should just declare a fiat format (import random; random.choice(…)), with the provision that other contenders will have their chance in future iterations.

effigies commented 5 years ago

Well, if we can consider JSON an acceptable choice, then I would probably just push on with .npy/.npz, for the simple reasons that it doesn't depend on a decimal serialization, it's mmap-able, can only hold one MD array (and thus doesn't permit complexity), and people have written parsers for MATLAB and R.

fangq commented 5 years ago

I just want to let everyone know I am currently working on a new neuroimaging data interchange format, called JNIfTI.

My current draft of the file specification can be found at

https://github.com/fangq/jnifti/

together you can find a matlab nifti-1/2 to jnifti converter and jnii/bnii data samples.

https://github.com/fangq/jnifti/blob/master/lib/matlab/nii2jnii.m https://github.com/fangq/jnifti/tree/master/samples

The basic idea is to use JSON and binary JSON (UBJSON) format to store complex scientific data, and completely get rid of a rigid, difficult-to-extend binary header. This makes the data readable, easy to extend and mixing with scientific data from other domains (like multi-modal data, physiology recordings, or computational models etc). There are also numerous JSON/UBJSON parsers out there, so, without writing any new code, a JNIfTI file can be readily parsed by these existing codes.

JNIfTI is designed with a compatibility layer to 100% translate the NIFTI-1/2 header/data/extension to the new structure, but once it is moved to JSON, you gain enormous flexibility to add new metadata, header info, organizing multiple datasets inside one document etc. I'd love to hear from this community, what additional information that are current lacking, and happy to accept proposals on defining new "required" metadata headers in to this format. My preference is to gradually shift the main metadata container from the NIFTIHeader structure to the "_DataInfo_" and "Properties" subfields in NIFTIData as the primary containers for metadata. This provides an opportunity to completely redesign the header entries.

https://github.com/fangq/jnifti/blob/master/JNIfTI_specification.md#structure-form

look forward to hearing from you.

PS: The foundation of the JNIfTI format is another specification called JData - a proposal to systematically serialize complex data structures, such as N-D arrays, trees, graphs, linked lists etc. The JData specification is current in Draft 1, and can be found at

https://github.com/fangq/jdata/

CPernet commented 5 years ago

I'm also all for @yarikoptic approach. Note that electrophys derivatives have the same issue with processed data typically in a different format, and we need a common ground. I discussed HDF5 with @GaelVaroquaux who have a strong opinion against it (maybe he can comment on that).

I'm sure @jasmainak made a table of pros and cons of various format already - but I cannot find it?

CPernet commented 5 years ago

as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well)

GaelVaroquaux commented 5 years ago

I discussed HDF5 with @GaelVaroquaux who have a strong opinion against it

I don't have a strong opinion against it. I just look at the past. A format using it was proposed years ago in the community. It was rejected by major actors because of the cost of supporting it.

effigies commented 5 years ago

@CPernet I'm not hearing anybody clamoring for HDF5, and several voices at least wary of it. My inclination at this point is to push on with .npy, since there wasn't really any push-back from that.

If we do want to resume consideration of options, I can start a list of pros/cons:

HDF5

Pros:

Simple ndarrays without additional internal metadata (i.e., can be equally well packed in .npy) should not suffer from maintenance complexity
libhdf5 exists with bindings in many languages
Transparent compression
Memory mapping

Cons:

People can abuse and start building hierarchical structures and encoding metadata directly in the file
Dependency on a single reference implementation (libhdf5); implausible to write alternative parsers

The former can be addressed by the spec and easily validated. And it's possible that parsing an HDF5 file with a single data blob would not be very problematic for an independent implementation.

npy

Pros:

Simple ndarrays without additional internal metadata are basically all that's allowed
Simple structure lends itself to easy reimplementation, at need
Existing implementations for multiple languages
Compression XOR memory mapping

Cons:

Compression XOR memory mapping
Possible (somewhat justified) perception of Python-preference baked into standard

as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well)

I think that might be going a bit far. For instance, per-ROI time series could be encoded in NIfTI, but not very naturally. TSV would make more sense, but a strict reading of this proposed rule would lend itself to contorting to keep things in NIfTI.

But the overall sentiment seems reasonable. I think a simple statement along those lines, but with a SHOULD, such that any deviation would need to be made with good reason, would be useful guidance.

maedoc commented 5 years ago

Pro: libhdf5 exists with bindings in many languages

This is offset by HFD5 being a single, C-based, strictly versioned API/ABI implementation deal, e.g. a browser based app can't ingest these files, a JVM app has to go through JNI, Julians who want pure Julia stuff won't be happy, etc.

Compression XOR memory mapping

Is offset by simple format; asking for simple, fast & small is greedy (have you ever listened to the clock tick while running xz?)

Possible (somewhat justified) perception of Python-preference baked into standard

You don't have to call it NumPy if you reproduce the definition as part of the standard; NumPy "compatibility" falls out as a happy side effect. If NumPy project decides to change formats down the line, you avoid another problem

CPernet commented 5 years ago

Following @GaelVaroquaux 'weak' opinion :-) if maintenance is an issue we should not go for HDF5. I have nothing against numpy array but you have to consider that SPM is still the most used software for fMRI, that MEEG is mostly Matlab (EEGLAB, FieldTrip, Brainstorm) and many users won't be familiar with it -- if .npy then also .mat otherwise language agnostic format

CPernet commented 5 years ago

as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well)

I think that might be going a bit far. For instance, per-ROI time series could be encoded in NIfTI, but not very naturally. TSV would make more sense, but a strict reading of this proposed rule would lend itself to contorting to keep things in NIfTI.

But the overall sentiment seems reasonable. I think a simple statement along those lines, but with a SHOULD, such that any deviation would need to be made with good reason, would be useful guidance.

Happy with having a statement and use SHOULD (I was not actually thinking .nii that much but .edf for electrophys)

gllmflndn commented 5 years ago

A few quick comments:

HDF5: It might have been "rejected" in the past (was it or just a lack of enthusiasm?) but I guess this can always be revisited if needs be.
I looked at npy/npz when it was mentioned here and added a reader in SPM (in spm_load) but I have to say I'm not a big fan of it. According to its specification:

The next HEADER_LEN bytes form the header data describing the array’s format. It is an ASCII string which contains a Python literal expression of a dictionary. The dictionary contains three keys: “descr”dtype.descr: An object that can be passed as an argument to the numpy.dtype constructor to create the array’s dtype.

and compression via a zip file. I think we should aim at something a bit better than that.

@fangq made a proposal in this thread based on OpenJData that should be discussed.
I remember hearing of another implementation of HDF5. I could now only find pyfive and jsfive - is anyone aware of something else?
MathWorks using HDF5 for their .mat file format is not really a poster child story. It is slower than their previous simpler binary format (they had to introduce a flag to disable compression from high level) and they haven't documented how the data are structured within the container (while the previous format has a public specification) making open implementations more difficult.
What are the latest thoughts of other options mentioned here, e.g. Zarr?
Is there anything to learn from Apache Arrow / Feather / Parquet?

effigies commented 5 years ago

@maedoc Thanks for those thoughts.

This is offset by HFD5 being a single, C-based, strictly versioned API/ABI implementation deal, e.g. a browser based app can't ingest these files, a JVM app has to go through JNI, Julians who want pure Julia stuff won't be happy, etc.

This is a pretty strong argument against HDF5, IMO. The Javascript validator is critical BIDS infrastructure, so specifying something it can't validate seems like a bad move. There are NodeJS bindings, so one option would be for the browser to warn on ndarrays and say "Use the CLI to fully validate." I don't really like it, but that's an option.

I'm not sure that a distaste for C bindings among some language partisans should be a significant criterion. It's obviously not ideal, but I don't think there are ideal solutions, here.

You don't have to call it NumPy if you reproduce the definition as part of the standard; NumPy "compatibility" falls out as a happy side effect. If NumPy project decides to change formats down the line, you avoid another problem

We haven't done something like this, up to this point. Referencing existing standards has been BIDS' modus operandi, and I think changing that shouldn't be done lightly. We can specify a given version of .npy format, if we aren't comfortable depending on their posture toward backwards compatibility.

@CPernet

have nothing against numpy array but you have to consider that SPM is still the most used software for fMRI, that MEEG is mostly Matlab (EEGLAB, FieldTrip, Brainstorm) and many users won't be familiar with it -- if .npy then also .mat otherwise language agnostic format

Unfortunately, there isn't really a language agnostic format for basic, typed, n-dimensional arrays. .npy is probably the closest that there is, and that's because it's so simple that reimplementing it in another language is very easy: Matlab, C++, R, Julia, Rust

jasmainak commented 5 years ago

@CPernet the table in question is here although perhaps the discussion here is more sophisticated already than what the table offers. I do remember that support for .npy in Matlab was experimental at the time we wrote the table although this may have changed.

gllmflndn commented 5 years ago

@effigies

This is a pretty strong argument against HDF5, IMO. The Javascript validator is critical BIDS infrastructure, so specifying something it can't validate seems like a bad move. There are NodeJS bindings, so one option would be for the browser to warn on ndarrays and say "Use the CLI to fully validate." I don't really like it, but that's an option.

Not that I'm too keen on HDF5 but cannot we expect this to be solved with WebAssembly? And this makes me come across yet another project...

effigies commented 5 years ago

@gllmflndn

I would not say HDF5 has been definitively rejected, but I would say the mood seems to be a step or two below "unenthusiastic".
I agree with your assessment of NPY. I also don't love that zipping changes it from a single ndarray to a dictionary of ndarrays, but if we can constrain HDF5 enough to be usable as a single array, then we can do the same for .npz.
I don't have any critiques of OpenJData (I haven't looked at it closely), but supporting a brand new format to the exclusion of existing formats would be a departure from how BIDS has operated.
I wouldn't be opposed to old-style .mat files. Support for those is reasonably widespread.
If I recall, zarr is more of a directory structure like exdir, and currently only has a Python implementation. In effect, it's a reimplementation of the HDF5 API with a literal directory structure and .npy arrays that are separated from their metadata.
I think we looked at Feather, etc., and they were targeting dataframes/records, which overshoots on our metadata requirements, and not naturally suited to >2D data. If that impression is wrong, I'm happy to reconsider.
If we can count on good browser support for HDF5 in the very near future, that would I think make the HDF5 case very strong.

Apologies if the many responses are dominating the conversation. I'm glad to see activity here, and hope we can keep this going for a little bit.

satra commented 5 years ago

as i slight side-note, HDF5 is the underlying data format for NWB (a brain initiative standard for neurophysiology, just like BIDS is for MR) at present. we are building a brain archive around it (focused on cellular neurophysiology data (nd-arrays of various kinds at sizes that are often 10 - 1000x a nifti file)), so we expect some of the tooling to become easier.

the javascript validation is a bit of a concern at present, but not a full blown technological concern. there are readers (https://github.com/usnistgov/jsfive), i just don't know how robust they are. and coming from nist there may be some longer term support.

the bigger problem with HDF5 is that it is a generic container for almost anything. i can turn an entire BIDS dataset into HDF5 or a NIFTI/CIFTI file into HDF5. so from a BIDS perspective one has to consider if there would be a specification for what level of granularity it entails. it's the scoping of the structure inside that counts from a specification perspective.

the notion of zarr precisely separates this notion. there is no metadata in the npy file, just an nd-array, how to read and interpret that array or to link different metadata pieces is in the yml counterpart. this is of course similar to openjdata, with formatting differences between json+bson vs yml+npy. header + image has been around for a while in many formats whether in a single file or not. the hard part is often what's in the header and how you layout the blob rather than the format itself.

CPernet commented 5 years ago

HDF5 is the underlying data format for NWB (a brain initiative standard for neurophysiology, just like BIDS is for MR)

@satra you mean just like for brain imaging EEG-BIDS MEG-BIDS iEEG-BIDS thank you :-D

fangq commented 5 years ago

@effigies, just a little clarification, OpenJData/JNIfTI is not a "brand new format", the .jnii (text-based JNIfTI) file is basically a plain JSON file - BIDS already uses JSON and I am sure everyone already has a JSON parser (in python/pandas/perl/c/javascript/matlab/...), you can load it directly. The JNIfTI specification does not invent a new format, but reply on existing widely supported format, and rather focuses on semantics - defining specific JSON name-tags to encode specific neuroimaging data, exactly like what the metadata is currently encoded in BIDS. The other rationale is that JNIfTI aims for 100% compatibility with NIfTI-1/2. This ways, it has minimum impact to the tool-chain downstream. By the way, JNIfTI toolbox is now available on Fedora/NeuroFedora. I am going to package it for Debian/Ubuntu next.

regarding HDF5, being super general and versatile is, IMHO, not a drawback. What is missing is a explicit protocol like JData specification (for JSON, which is even more general and versatile), that one can define specific data containers for specific data types. For example, HDF5 does not directly encode complex number support - various toolboxes place a complex array as a compound database with "r" as real part and "i" as the imaginary part - this is quite arbitrary and can potentially cause problems when we share data. But this can be solved by making a data specification just like JData, but for HDF5 (actually I don't see why JData's constructs can not be extended for HDF5).

here are my two cents regarding HDF5 as I kept working on the EasyH5 toolbox in the past month (for reading/writing SNIRF data):

it is very fast and versatile, with complex APIs aiming for high performance for large sized binary datasets
it has overhead (and overkills) for saving lightweight metadata records (like JSON)
there is a learning curve on using the APIs, even for simple use-case
it lacks of explicit complex data structure support (like JSON) - by "complex data structure", I didn't mean hierarchical data, but constructs (groups/datasets) to encode common data structures such as complex-valued ND arrays, sparse arrays, linked lists, trees, tables and graphs.
it lacks of an intuitive way to store arrays of groups (thinking of cell arrays or struct arrays in matlab): every group element in the HDF5 tree is a "named" object, and does not associate with an index (unless one defines a customized data in the attributes). in other words, compare to JSON, it only has "{}" constructs but not "[]".
an annoyance in HDF5 is that the data records lost its creation order and automatically sorted by alphabetic order after saving/reading, unless you use specific tags and ways to read and write them.

fangq commented 5 years ago

Just want to bring this on everyone's radar - MessagePack is a JSON-like binary format that has also attracted broad support. It supports strongly-typed binary hierarchical data (as general and versatile as JSON), with extremely fast parsers/writers for dozens of languages, such as Pandas. My JSONLab toolbox also supports reading/writing msgpack files for MATLAB.

Compared to the Universal binary JSON (UBJSON http://ubjson.org) - the choice I made for binary OpenJData, the msgpack files are slightly more compact, but slightly more complex to decode/encode due to the support to single-byte data records and more data types. Still, it is a very simple construct like JSON. I slightly lean towards UBJSON due to the human-readability (despite being binary) compared to msgpack.

However, like native JSON, it uses nested array constructs to encode N-D array (which is fine, as long as the reader/writer process rectangular data). I proposed a new grammar to store packed/typed ND array in msgpack, but it is still currently under discussion

https://github.com/msgpack/msgpack/issues/268

satra commented 5 years ago

@CPernet

you mean just like for brain imaging EEG-BIS MEG-BIS iEEG-BIDS

in some ways yes, but i should emphasize - cellular neurophysiology - which EEG, MEG are not, but iEEG can come close to.

CPernet commented 5 years ago

@jasmainak table copied here

Data format	Pros	Cons
.mat	Open specification Well supported I/O in both Matlab and Python	Proprietary format Allows for highly complex data structures that might need further documentationv7.3 is which is based on HDF5 format (not proprietary) is not supported in Python
.npy	Open specification Well supported I/O in Python and C++Allows only n-dimensional arrays, limited complexity and thus not easily abused	Experimental support for Matlab
.txt	Simple and easy I/O	Large memory footprint, inaccurate numeric representation

satra commented 5 years ago

just a quick note that while there is no direct support like loadmat/savemat, the 7.3 matlab format can be read and written in python.

robertoostenveld commented 5 years ago

I am aa bit late to this party, but reading through the thread of comments I see that a lot of considerations have been addressed and many good comments. There seem to be two main issues that make it hard to reach consensus.

The first is whether a flat structure of a single N-D array with type, shape and other metadata (presumably small and formatted as text) is sufficient or whether a hierarchical structure is to be allowed. This is to me at the core of the discussion of npy versus hdf5.

The second is that of it being a more lightweight programming language agnostic format, versus it depending on bindings to a c-library. The latter results in concerns for the data being cross-platform compatible with sandboxes and containers on more restricted compute environments (such as in a web browser).

I will not attempt here to resolve these two. What I do want to bring up is that of identifying possible formats, where for the Electrophysiology formats we used a poll to tap into the community wisdom. One format that I think needs serious (re)consideration is NifTI. The reason for the actual discussion here is probably that it was considered not to be suitable, but that is something I would like to question.

NIfTI allows for storing N-D arrays of most data types (although I am not sure about complex numbers) with up to 8 dimensions. Four of them are fixed: x, y, z, t, and 4 are free to be specified by the user. This was exploited in the HCP project for the CIfTI specification, which builds on nifti2 (to address the file size limitation of nifti1). Furthermore, metadata of the CIfTI specification is stored in a header extension (in this case it happens to be XML formatted), which is a well defined feature of both nifti1 and nifti2. Although the CIfTI specification speaks about a "file format", CIfTI files are technically only NIfTI files with a specific header extension, specific file naming scheme, and a .cii extension.

Are there concerns that the 8 dimensions supported by nifti are not sufficient? Or in case the first 4 (xyzt) cannot be used, that the remaining 4 are too limited? In case the N-D structure of nifti is not appropriate because being non-hierarchical, then that would also apply to the npy format. The main advantage for nifti is that it is a format that is already part of BIDS, we are all familiar with it, and there exists a lot of tooling for it. It also ties in with the comment of @cpernet in keeping derived data files - where possible - in their original format.

Perhaps there are good reasons discussed elsewhere why the nifti format is not appropriate. If so, I think it would be good to keep those more explicitly into account in this discussion.

Besides considering nifti for generic N-D data (with N up to 8, or up to 4 when skipping the first 4 dimensions) as exemplified by cifti, I also want to draw your attention to a "file format" that has already been adopted in the BIDS specification for physiological and other continuous recordings: a TSV without headers, accompanied by a JSON. The arguments for that are listed as to improve compatibility with existing software (FSL PNM) as well as make support for other file formats possible in the future. That future might be now. I value the (historical) wisdom that went into previous decisions for the BIDS specification, and hence want to consider to reuse the construct of a TSV+JSON, for example as a NPY+JSON pair. Similar as the "headerless" constraint on the TSV file, a constraint on the NPY descr/dtype could be imposed to exclude the possibility for storing e.g. pickled objects.

I see great value in keeping BIDS compact; extending it with data formats that only tailor specific use cases has the risk of there being too many other use cases also claiming that privilege and the specification getting too bloated, which may result in fracturing the specification over different development groups and/or modalities.

satra commented 5 years ago

i would strongly object to repurposing nifti arbitrarily for computational models. nifti carries a lot of baggage from functional and structural neuroimaging (e.g., x,y,z,t). if what is being simulated/computed on is a structural/functional timeseries, then by all means use Nifti, but let's not expand the scope of nifti here.

also the BIDS iEEG standard already has support for a number of formats: https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/04-intracranial-electroencephalography.html#ieeg-recording-data

so from a bids compactness standpoint, perhaps we can limit things to the set of formats that already cover a wide range.

robertoostenveld commented 5 years ago

I read up on http://reproducibility.stanford.edu/bids-computational-models-summary, but it is not clear to me whether this actually corresponds to the effort of BEP002 (link according to the bids homepage). BEP002 to me seems still closely linked to imaging, which triggered me to consider the full nifti spec.

Note that I don't want to push for nifti, just that it gets considered (and possibly rejected on basis of good reasons). Idem for the JSON+NPY pair that I also suggested above.

tyarkoni commented 5 years ago

BEP002 applies only to statistical models for imaging, and is different from BIDS-ComputationalModels.

robertoostenveld commented 5 years ago

where can I read more about BIDS-ComputationalModels? Or is it not yet in a state of a BIDS extension proposal?

effigies commented 5 years ago

@robertoostenveld There are two documents I have that came out of that meeting.

1) Summary document: https://docs.google.com/document/d/1hoLFzQYw-VqU5nuDjVz7NEOgJUGa4zmX1mA67CjM-UA 2) Some provisional BEP text: https://docs.google.com/document/d/1oaBWmkrUqH28oQb1PTO-rG_kuwNX9KqAoE9i5iDh1xw

The goal is to write up at least 2 BEPs, as well as have contributors join existing BEP efforts where there was overlap. And this issue was presumed to be too small for a BEP and a quick (!) issue and PR would result.

There may also have been something that came out of the ModelGraph discussion that was headed up by @tyarkoni and Jon Cohen, but I don't have a link to that that I can find...

tyarkoni commented 5 years ago

There's a provisional implementation of the JSON standard in PsyNeuLink, though it doesn't include full documentation:

https://princetonuniversity.github.io/PsyNeuLink/json.html

PeerHerholz commented 5 years ago

@robertoostenveld, here's also the Pre-BEP issue and the team. @effigies I'm getting an "Access denied" for the provisional BEP text. In case it's not my fault and it's okay to be shared, could you maybe change the access rights?

effigies commented 5 years ago

@PeerHerholz Thanks for the heads up. I've made it world-commentable, though it should be understood to be more notes from the meeting than a specific proposal.

effigies commented 5 years ago

To come back to the actual thread:

I also would not like to repurpose NIfTI. Mainly because this would in effect be another custom format. If we're going to do this, we might as well use something more generic.

Regarding hierarchy, I also want to explicitly state my opposition to MDA files being hierarchical, even if the container has the capability for hierarchy. BIDS provides a great deal of hierarchy already.

Where I would like to get is to this: When I want to load an N-dimensional array, I determine its location in the filesystem according to BIDS rules. Any metadata I want about the data is in human-readable JSON sidecars. When I load the array, I unambiguously get (across languages) a single data array of a given size, shape and type.

This is most appropriately seen as an extension of the selection of TSV for 1- or 2-dimensional arrays, which obviously breaks down at that point.

How that last is to be done is a function of whatever format(s) we choose, but requires some restriction on almost all of them (I think the only exception being an uncompressed .npy file).

I suspect we've made the arguments there are to be made at this point, but there is no clear consensus. I'll make a few comments after this, that people can vote on with reactions:

Vote	Reaction
Yes	:+1:
No	:-1:
Abstain	:eyes:

For those less familiar with GitHub, reactions can be found in the top right of each comment:

Edit: I posted three proposals. I think they'll help frame the discussion of which format(s) to proceed with.

robertoostenveld commented 5 years ago

@PeerHerholz thanks for those links, that clarifies. I recommend to follow the BEP Lead Guidelines (link from the BIDS homepage) to ensure community feedback and consensus.

effigies commented 5 years ago

Proposal: A multidimensional array file is to contain a single, n-dimensional array.

bids-standard / bids-specification

[ENH] Proposal for multidimensional array file format #197

196 https://github.com/bids-standard/bids-specification/issues/196).

HDF5

npy