[ENH] Accept Parquet (`<entities>_<suffix>.parquet`) as alternative to `.tsv` and `.tsv.gz` formats

effigies commented 7 months ago

Your idea

BIDS has generally followed the convention of adopting human-readable or widely-adopted standards for its files. At 1.0, we used .tsv for all tabular files except physiological and stimulus recordings, which use a headerless .tsv.gz format. In 1.9, we added a headerless motion.tsv file, which is quite large. The eye-tracking BEP (#1128) is underway, which is having to cope with some limitations in the TSV options.

In 2024 we now have over a decade of the Apache Parquet format development. The format specification is open, and there is a Project(Arrow) which includes native libraries or bindings for Python, MATLAB, R, Julia, Java, Javascript and C, among others.

For data that do not benefit from human readability (TSV files > ~1k lines), Parquet offers advantages such as typed columns, chunked compression, as well as not requiring round-trips between floating point and ASCII decimal representations.

I propose the following:

1) Allow .parquet files anywhere that a TSV or TSV-GZ file is currently permitted. 2) RECOMMEND to use .tsv for high-level metadata tables, such as participants.tsv, *_sessions.tsv and *_scans.tsv as well as *_channels.tsv, *_electrodes.tsv and similar metadata files. 3) Requirements on column orderings, types, uniqueness should be unchanged.

This is pulled out of https://github.com/bids-standard/bids-specification/issues/197, which is about N-dimensional data. I am excerpting the relevant recent posts here:

@satra (https://github.com/bids-standard/bids-specification/issues/197#issuecomment-1941761949)

it may be good to revive this discussion as i'm seeing a few upcoming use cases that will require a more sophisticated consideration for many things that are now in TSVs. here is a temporary proposal to narrow down the conversation.

apache parquet for table like formats (the reason i'm separating this out is that there are significant efficiencies in not considering this a subset of n-d array). ...

@effigies (https://github.com/bids-standard/bids-specification/issues/197#issuecomment-1941816055)

I am +1 for parquet to be adopted for any TSV data files (physio, stim, motion, blood). It's an open spec with broad implementation and readily available command-line tools for inspection. I think it should probably be discouraged if not prohibited for metadata files (participants.tsv, samples.tsv, sessions.tsv and scans.tsv, electrodes.tsv, channels.tsv), which benefit from human readability. I think it will often be a poor choice for events.tsv, but I wouldn't rule it out.

I am not sure that there is an actual "to-do" here for N-dimensional named arrays except to adopt them in principle so that a BEP that needs this structure can use it. I do not think there is any call to allow an events.zarr file with 2D onsets or 3D durations. HDF5 and Zarr are both already present in NWB, SNIRF and OME-Zarr.

@bendichter (https://github.com/bids-standard/bids-specification/issues/197#issuecomment-2053852626)

Another +1 for the usage of parquet for tabular data, e.g. physio, stim, motion, etc. I like to call these types of data "measurements" and call e.g. participants.tsv, eletrodes.tsv, etc. "records." The current TSV have some problems that are limiting for measurements:

you need to truncate decimals which means you lose precision

they are very space-inefficient

you don't have direct/random access to data

Gzipping the TSVs doesn't really solve any of these issues. Parquet is more performant in read, write, and storage volume, and is an open standard with large cross-platform support.

We are looking at adopting BIDS for neurophysiology applications. Without a binary-style filetype option, we would need to convert our efficient data storage solutions into TSV which is a much less efficient/performant file type than the current solution. Being able to use parquet for physio etc. would make me much more comfortable with adopting BIDS.

bendichter commented 7 months ago

Well put. I would just add that the new PR to include HDF5 and/or Zarr (https://github.com/bids-standard/bids-specification/pull/1614), plus your comments about how this could be extended to handle tabular data (https://github.com/bids-standard/bids-specification/pull/1614/files#r1499517851), ameliorates most of my concerns. I still like parquet here, but I also feel that my concerns could be mostly addressed by using HDF5 and/or Zarr in the way you propose there.

oesteban commented 7 months ago

I am +1 in the proposal and +1 in how Chris describes it (totally agree with that "Well put" by @bendichter)

Just a nuance:

The eye-tracking BEP (#1128) is underway, which is having to cope with some limitations in the TSV options.

The problems with TSV in BEP 020 are more about the not-very-explicit-but-not-implicit enforcement in BIDS that TSV.GZ files MUST encode only continuous, regularly sampled, and single-epoch data. This could be easily workarounded by:

Allowing 'SamplingFrequency' to take other than a number to signal that the file is not regularly sampled
An index column (like the 'onset' of events files) should be added to encode the sampling points.

I would imagine that this issue is orthogonal to the actual data format.

bendichter commented 7 months ago

I agree with @oesteban that an optional timestamps column would be helpful, though I think that's a separable issue from the file type discussion. Maybe we could discuss it in a new issue?

bids-standard / bids-specification

[ENH] Accept Parquet (`<entities>_<suffix>.parquet`) as alternative to `.tsv` and `.tsv.gz` formats #1792

Your idea