bids-standard / bids-2-devel

Discussions and suggestions of backwards incompatible changes to BIDS
https://bids.neuroimaging.io/
Creative Commons Attribution 4.0 International
10 stars 1 forks source link

Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json #71

Open yarikoptic opened 2 months ago

yarikoptic commented 2 months ago

This issue is collating two aspects but I think it is warranted. If would be desired - we could split into two.

BIDS 1.x situation

ATM, .tsv.gz are a not just a compressed .tsv like e.g. it happens with .nii.gz and .nii -- they are *special** as they are not to carry the header as .tsv files do.

The "specialty" extends into side-car .json files

If I got it right (@effigies can correct) the header was excluded from .tsv.gz as "not readily readable". May be some folks remember also further details? IMHO argument is weak since it is just a matter of adequate abstraction of "file opener" like e.g. is done in Python. But even if we place that aspect aside I think we would benefit from a more harmonious approach, which only might require 1 extra check for validator:

BIDS 2 proposal

  1. both the .tsv and .tsv.gz should carry a header. .gz would only signal compression.
  2. both .tsv.gz and .tsv should be supported interchangeably across uses
    • it would be for a user to choose most appropriate form based on use-case
    • it will be RECOMMENDED to use .tsv form for the cases where immediate user readability is desired (subjects.tsv, sessions.tsv etc) unless prohibitive in size (e.g. subjects.tsv for 10000 subjects with 100 columns or smth like that)
  3. .json for either case of .tsv or .tsv.gz MAY describe columns within Columns field of the .json which would be a dict containing records conforming current set of fields we reserve for .tsv files .json's but also adding 1 OPTIONAL field (but may be RECOMMENDED for .tsv.gz) - Index which would provide ordering information. bids-validator could easily ensure corresponding to the order in .tsv or .tsv.gz.
    • I didn't look into JSON specification/libraries either we can also simply "enforce" that order should correspond to Index, ie. if dicts are ordered like now in Python.

Cons

Pros

arnodelorme commented 2 months ago

I would vote for uniformity between .tsv and .tsv.gz

effigies commented 2 months ago

.tsv.gz was a pragmatically useful choice for working with existing non-BIDS tools (e.g., FSL's PNM) that expected separately-entered column identification and accepted compressed data. One possibility to compromise here would be something like:

Extension Headers Compression Examples
.tsv First line None events.tsv
.tsv.gz First line gzip New
.bare.tsv Sidecar None motion.tsv -> motion.bare.tsv.gz
.bare.tsv.gz Sidecar gzip physio.tsv.gz -> physio.bare.tsv.gz
.parquet In-file Optional New

We could state that any of these is acceptable (perhaps with a preference in some use cases), assume people will use one that matches the typical use case for their dataset, and make a simple tool available to convert among them.