Open effigies opened 7 months ago
Well put. I would just add that the new PR to include HDF5 and/or Zarr (https://github.com/bids-standard/bids-specification/pull/1614), plus your comments about how this could be extended to handle tabular data (https://github.com/bids-standard/bids-specification/pull/1614/files#r1499517851), ameliorates most of my concerns. I still like parquet here, but I also feel that my concerns could be mostly addressed by using HDF5 and/or Zarr in the way you propose there.
I am +1 in the proposal and +1 in how Chris describes it (totally agree with that "Well put" by @bendichter)
Just a nuance:
The eye-tracking BEP (#1128) is underway, which is having to cope with some limitations in the TSV options.
The problems with TSV in BEP 020 are more about the not-very-explicit-but-not-implicit enforcement in BIDS that TSV.GZ files MUST encode only continuous, regularly sampled, and single-epoch data. This could be easily workarounded by:
I would imagine that this issue is orthogonal to the actual data format.
I agree with @oesteban that an optional timestamps column would be helpful, though I think that's a separable issue from the file type discussion. Maybe we could discuss it in a new issue?
Your idea
BIDS has generally followed the convention of adopting human-readable or widely-adopted standards for its files. At 1.0, we used
.tsv
for all tabular files except physiological and stimulus recordings, which use a headerless.tsv.gz
format. In 1.9, we added a headerlessmotion.tsv
file, which is quite large. The eye-tracking BEP (#1128) is underway, which is having to cope with some limitations in the TSV options.In 2024 we now have over a decade of the Apache Parquet format development. The format specification is open, and there is a Project(Arrow) which includes native libraries or bindings for Python, MATLAB, R, Julia, Java, Javascript and C, among others.
For data that do not benefit from human readability (TSV files > ~1k lines), Parquet offers advantages such as typed columns, chunked compression, as well as not requiring round-trips between floating point and ASCII decimal representations.
I propose the following:
1) Allow
.parquet
files anywhere that a TSV or TSV-GZ file is currently permitted. 2) RECOMMEND to use.tsv
for high-level metadata tables, such asparticipants.tsv
,*_sessions.tsv
and*_scans.tsv
as well as*_channels.tsv
,*_electrodes.tsv
and similar metadata files. 3) Requirements on column orderings, types, uniqueness should be unchanged.This is pulled out of https://github.com/bids-standard/bids-specification/issues/197, which is about N-dimensional data. I am excerpting the relevant recent posts here:
@satra (https://github.com/bids-standard/bids-specification/issues/197#issuecomment-1941761949)
@effigies (https://github.com/bids-standard/bids-specification/issues/197#issuecomment-1941816055)
@bendichter (https://github.com/bids-standard/bids-specification/issues/197#issuecomment-2053852626)