Relax extension requirements, maybe rename to formats

TheChymera commented 8 months ago

With some file formats supporting data across modalities (any volumetric data can be NIfTI, any raster image can be TIFF, anyting at all can be ZARR) I wonder if it makes sense to restrict these “extensions”.

I'm also wondering whether the terminology shouldn't be renamed to “formats”.

More generally, I'm also not sure why the emergence of a new format would need to be “accepted” by BIDS first before a dataset using it can be BIDS-compliant. Is there any reason why we would ever say no? If not, why not allow any data format?

I'm mentioning data format specifically, because for metadata files, which BIDS as a standard controls the contents of, we can't just have people using participants.xlsx. But BIDS does not control the analysis of TIFF, or NWB, or MNAF (my new amazing format), so why not let people use whatever fits their use case?

I see some utility in discouraging bad practices, such as proprietary or .m files for everything, or compressed .jpeg for optical imaging — so maybe allowing anything would go too far. But in any case I think open formats with no compression could be globally accepted.

sappelhoff commented 8 months ago

Is there any reason why we would ever say no? If not, why not allow any data format?

the main reason is to provide an incentive for people to flock around a few data formats and make them better ... as opposed to everyone "brewing their own thing". The latter leads to an increased burden in managing IO software, and also for end users to be able to "know" more data formats.

TheChymera commented 8 months ago

the main reason is to provide an incentive for people to flock around a few datasets and make them better

Did you mean “data formats” instead of “datasets”? If not I don't think that's something I ever saw as a goal of BIDS, i.e. consolidating a small number of datasets as opposed to allowing better access to as many as possible — nor do I see what it has to do with data formats.

If you meant “data formats”... it's not just that we're limiting the total number of formats we support, but also constraining them per modality for reasons that are more historical than anything. In a sense that constrains the IO software. Lots of data can be represented as NIfTI, and thereby analyzed with the rich NIfTI tools, so why restrict that? Blanket permitting all/some formats would allow formats to better spread across use cases based on the tooling support they have.

sappelhoff commented 8 months ago

Did you mean “data formats” instead of “datasets”?

yes, sorry.

but also constraining them per modality for reasons that are more historical than anything.

I wouldn't say that we do that for "historical" reasons. To my understanding we do that to reflect the most common practices in the field where a particular modality is used. For example, NIfTI is used in MRI ... but not in EEG, even though you probably could somehow encode EEG data in NIfTI.

Blanket permitting all/some formats would allow formats to better spread across use cases based on the tooling support they have.

yes, but it will also invite edge cases, where a single dataset curator is exceptionally well versed in a particular data format and uses/applies it ... however the large majority of the community won't be able to use it because they lack the tools/skills.

I am playing a bit of a devil's advocate here. I personally don't have a big horse in this race. But I do think that fewer, rather than more, data formats are a good idea. I am saying this coming from a project like MNE-Python, where every other few months somebody is requesting support for yet another data format that is entirely unnecessary as the data could be represented in an already existing (open) format.

yarikoptic commented 8 months ago

The whole point of any standardization is to minimize variability. BIDS did not only minimize variability in how people name their files, but also in file formats to use. Hence you @TheChymera can always open participants.tsv and not some participants.xyz of an unknown nature. Allowing for any file format immediately opens unlimited variability, and thus makes standard much less valuable. And hence in BIDS we limit to most common format(s).

Someone in turn could establish some "BIDS naming convention" or "BIDS naming principles" which would then allow for arbitrary file formats to be used and rather just promote use of schema and the rest of the logic behind files organization. But it would be a different project.

I think it is time for us to add some indicators to issues so we could get some kind of a sense on which ones to keep open or close, so :-1: this one as IMHO I do not think it would be wanted/result in being implemented. apparently I have already added on that 7 months ago in README.md ;)

poldrack commented 8 months ago

+1 to closing this.

TheChymera commented 8 months ago

can always open participants.tsv and not some participants.xyz of an unknown nature

@yarikoptic but that's exactly not what I meant. I was referring to data formats specifically. I even gave the exact same example:

I'm mentioning data format specifically, because for metadata files, which BIDS as a standard controls the contents of, we can't just have people using participants.xlsx.

Yes, the metadata files, like the file naming conventions, are optimized for easy browsing, readability, and (maybe on purpose maybe incidentally) are very convenient to manipulate with GNU coreutils or other ubiquitous CLI packages. The point is data files are different because they require additional tooling anyway:

Proposal 1: So if we already “support” a format, why not support it wherever the experimenter might find it useful?
Proposal 2: If we do not support a format yet, why not auto-support any data format? Do we have any examples where we have vetoed a file format?

I already mentioned that proposal 2 was probably not as good as proposal 1, because there are some reasons to exclude e.g. proprietary formats.

@sappelhoff

To my understanding we do that to reflect the most common practices in the field where a particular modality is used.

But should a dataset be “invalid” for using an uncommon practice, even if it's still open source and useful to the experimenter?

yes, but it will also invite edge cases, where a single dataset curator is exceptionally well versed in a particular data format and uses/applies it ... however the large majority of the community won't be able to use it because they lack the tools/skills.

Isn't that an edge case we want? Think of the following: MRI expert wants to integrate data with microscopy, and use NIfTI for everything, including the microscopy data, so it can all be handled in the same space with the same tools. Why block that?

I am playing a bit of a devil's advocate here. I personally don't have a big horse in this race. But I do think that fewer, rather than more, data formats are a good idea.

In a sense, that's addressed by proposal 1. The guy from the NIfTI example is me. I'd like to use NIfTI for more things. More broader acceptance of formats that are already accepted could materialize in a consolidation around fewer formats. I also think there are other people who would like to .zarr everything.

sappelhoff commented 8 months ago

Do we have any examples where we have vetoed a file format?

during several BEP processes (e.g., EEG, iEEG) several file formats have been vetoed

TheChymera commented 8 months ago

@sappelhoff oh, I was unaware, thanks for telling me. Do you remember which ones they were or have a link tot he discussions? I'm curious what demonstrably disqualifies a format.

sappelhoff commented 8 months ago

Most of these discussions happened on the old BEP006 Google Doc, and there was a community survey about data formats used in the community in 2018. the survey results used to be reported here: https://bids.berkeley.edu/news/bids-megeegieeg-data-format-survey. Unfortunately this archive did not preserve images: https://web.archive.org/web/20230130152808/https://bids.berkeley.edu/news/bids-megeegieeg-data-format-survey ... but perhaps you can do some digging and find something.

I'm curious what demonstrably disqualifies a format.

we wanted the file formats to:

have an open specification and be usable (in terms of a potential license)
be widely used in the community (no niche formats)
be able to hold most data the community would possibly want to save (and all data considering the union of all accepted formats)

bids-standard / bids-2-devel

Relax extension requirements, maybe rename to formats #64