bids-standard / bids-2-devel

Discussions and suggestions of backwards incompatible changes to BIDS
https://bids.neuroimaging.io/
Creative Commons Attribution 4.0 International
10 stars 1 forks source link

Use-case(s): BIDS-inspired/like standards #62

Open yarikoptic opened 4 months ago

yarikoptic commented 4 months ago

Quite often projects do not adopt BIDS due to complexity, not perfect fit, (???) and establish new standards/file layouts while saying they are "BIDS-like". The likeness varies greatly. Quite often it is only in the aspect of having folders and file names with some metadata in them. It is not really worth mentioning those approaches I guess. But there is a good number of BIDS-like standards (as in my words - formalized descriptions adopted by a considerable number of people) which are worth reviewing and analyzing for what could minimize divergence between them and current BIDS through possibly introducing missing but reasonable and desired features into BIDS 2.0.

This issue would be used to collect pointers and possibly summarize/point to summaries of rationale behind each one

DANDI layout

Established by me and @satra for https://dandiarchive.org, primarily due to complete lack of usable .

Notable divergences:

PsychDS

https://psych-ds.github.io/ https://github.com/psych-ds/psych-DS (attn @mekline and @bleonar5 - would appreciate details/feedback alike for DANDI here or in a dedicated issue/doc)

NeuroBlueprint

https://neuroblueprint.neuroinformatics.dev/specification.html . Request for summarization of rationale/divergences: https://github.com/neuroinformatics-unit/NeuroBlueprint/issues/51

SPARC Data Structure (SDS)

The SPARC platform has a preprint online that describes a BIDS inspired data structure: https://www.biorxiv.org/content/10.1101/2021.02.10.430563v2. Rough details are also on their wiki.

TemplateFlow

https://www.templateflow.org/usage/archive/#acceptable-data-types

A related discussion "has happened" in

Brain-Development.org Atlas

https://brain-development.org/brain-atlases/atlases-from-the-dhcp-project/cortical-surface-template/ describes itself as "using BIDS conventions", and proceeds to define custom entities and metadata.

NiPoppy

Study-level description which includes bids dataset and uses some conventions (like derivatives/ subfolder with clearer defined naming convention)...

satra commented 4 months ago

ping @tgbugs for SPARC info on bids-like

dorahermes commented 4 months ago

SPARC Data Structure (SDS)

The SPARC platform has a preprint online that describes a BIDS inspired data structure: https://www.biorxiv.org/content/10.1101/2021.02.10.430563v2. Rough details are also on their wiki.

edit by @yarikoptic : thanks -- added

satra commented 4 months ago

also pinging @saskiad and @dyf for bids inspired data container for the allen institute for neural dynamics

yarikoptic commented 4 months ago

also pinging @saskiad and @dyf for bids inspired data container for the allen institute for neural dynamics

after review with @dyf we agreed that it was a little too distant from BIDS, at most indeed just "inspired" ;), so we have it as

satra commented 4 months ago

that's good. although, i'm not quite sure that use-case is too distant. practically speaking, we run into that issue with ukbiobank and any of the large datasets where we need to process a few subjects with bids-apps. we have had to create sub-bids to satisfy bids and hence tools like fmriprep. one could say that bids should not care about that. however, a self-contained single subject/session subset would be a relevant use-case in bids i believe.

ps. the challenge has been that bids has the different levels of consolidation of information: grouped (participants, sessions, etc) inheritance (via jsons), and single (the files). this necessitates a connected structure that relies on those pieces of information. the advantage of bids is efficiency (for the grouped files; although longitudinal data is inefficient) and deduplication (for the inheritance), and readability (single path+filename).

these different use cases may be good to consider in bids 2.

effigies commented 4 months ago

Will also add Templateflow: https://www.templateflow.org/usage/archive/#acceptable-data-types

niksirbi commented 4 months ago

Our NeuroBlueprint specification is definitely in the BIDS-inspired category. Together with @JoeZiminski we are working on putting together a list of divergences from BIDS and the rationale behind them.

yarikoptic commented 4 months ago

however, a self-contained single subject/session subset would be a relevant use-case in bids i believe. ... these different use cases may be good to consider in bids 2.

@satra see

May be we should add/promote upvoting via :+1: on the issues, so "go wild" ;)

bleonar5 commented 4 months ago

I can share some context on Psych-DS if it's helpful, but I'm unsure how to comment on DANDI. Our data standard is pretty explicitly modeled on BIDS and our validator tool is a essentially a very pared-down fork of BIDS' recent deno implementation.

Definitely our reasoning for diverging instead of using BIDS directly or creating some sort of module within it has to do with complexity. A big part of the ethos of our project is simplicity, since we're trying to bring researchers with a lack of experience with explicit data standards into the fold of producing FAIR datasets. We designed our standard to be the minimal set of conventions for producing consistently-structured, machine-readable datasets with linked metadata, and we avoided the impulse to include additional advanced/options conventions or conventions governing the internal content of datafiles, because we figured that even the presence of this additional material in our documentation could scare off our target audience.

On a technical level, I noticed that the BIDS deno-based validator only applied rules to files that actually appeared within datasets, with no functionality to produce errors/issues in cases where certain elements were absent, and since this notion of presence/absence was important in our schema, that was one initial additional impetus for diverging with our own tool. Additionally, with Psych-DS 1.0 at least, we only wanted to validate simple tabular CSV data, and a lot of the structure of the BIDS validator had to do with applying different rules and conventions depending on datatype.

We followed BIDS' lead when it came to our usage of linkML for creating a structured model of our schema, and I definitely used BIDS' examples explicitly when developing our stack of tools, which was extremely helpful.

These are just a few random thoughts and pieces of context. Feel free to ask me anything specific and I can answer in detail. Also, I should mention that @mekline is on maternity leave until sometime this June/July and @ianchandlercampbell is our interim director for the project.

yarikoptic commented 4 months ago

That is very valuable insights @bleonar5 , thank for sharing! Given that you seems to use BIDS schema, my short overarching summary would be to: Re-use BIDS schema formalization to derive a customized subset of the BIDS standard to simplify domain specific use and adoption. If I am totally off -- please correct me ;)

I noticed that the BIDS deno-based validator only applied rules to files that actually appeared within datasets, with no functionality to produce errors/issues in cases where certain elements were absent, and since this notion of presence/absence was important in our schema, that was one initial additional impetus for diverging with our own tool.

Could you elaborate here (or even as a dedicated issue against bids-validator, which is if not pertinent - could be closed) more on this since I am not fully grasping, since as to me bids-validator must error whenever any REQUIRED component (metadata or file) is missing.

Additionally, with Psych-DS 1.0 at least, we only wanted to validate simple tabular CSV data, and a lot of the structure of the BIDS validator had to do with applying different rules and conventions depending on datatype.

also sounds very intriguing and like something what could be generally applicable to BIDS. Could you elaborate more?

We followed BIDS' lead when it came to our usage of linkML for creating a structured model of our schema

do you have a link to linkML models handy?

TheChymera commented 4 months ago

@bleonar5 some of the links on the PsychDS README are broken, could you share a tree view of a dataset?

reasoning for diverging [...] has to do with complexity. A big part of the ethos of our project is simplicity,

I think that's also part of the ethos of BIDS, perhaps we could look into simplifying BIDS for 2. as well. What are some of the key complexity concerns which made BIDS 1. less attractive?

bleonar5 commented 4 months ago

@yarikoptic My first response was a bit cursory and based on my memory of our initial rationales for diverging, I'll try to dig in a bit deeper here. I think your summary of our rationale was correct: we wanted to provide similar structures and standards to those that BIDS provides, for researchers that deal with behavioral data rather than complex physiological data. @satra informed us about the BIDS' team's development of a structured schema model in linkML, and this satisfied one of our core desiderata for the project, which was to have an externalized, structured schema that we could reference across validator tools in multiple frameworks (node, R, python). So, we used the combination of pruned-down versions of BIDS' in-development Deno validator and linkML schema as (very helpful) jumping off points for our own development, with proper citations and acknowledgements, of course.

@mekline has had a much longer history with the development of Psych-DS as an independent entity, and could possibly speak to our rationales for divergence much better, and she may be able to share more detailed thoughts on her return. One crucial element that I'm remembering now is a technical difference between the structure of most physio data and the behavioral data that we're interested in. Physiological data is so rich and often tied to multiple measurements over time, that it seems to be a standard assumption (and I think this is reflected in the BIDS spec) that datafiles will be organized around individual subjects/sessions. In a lot of behavioral datasets, this is not the case, as the whole set of responses for a given subject may be representable in a single row, and one datafile may represent the data gathered from an entire experiment. BIDS is complex and my knowledge of/experience with it only extends to the research I did prior to beginning development of the Psych-DS validator, but it seemed to us that following some kind of subject-oriented system of data organization would be necessary for compliance with BIDS, and this was a major rationale for divergence. (@TheChymera, I think this paragraph is the most relevant answer to your second question)

As for the matter of presence/absence of files/directories that I mentioned previously, I think this is actually just an issue with the deno-based validator rather than the older, public-facing validator. And the deno validator is still in development, so it may just be that I mistook a bug/unfinished component for an actual aspect of the BIDS spec. Basically, if you provide an otherwise-valid BIDS dataset that is missing an element (such as the dataset_description.json file) to the web validator, it produces an error as expected (DATASET_DESCRIPTION_JSON_MISSING). If you do the same with the deno validator, it outputs a VALID_DATASET result and does not report the absence of the required file. This is because the validator crawls the filetree of the dataset, finds whatever files/directories are present, and runs a series of checks on them based on the rules in the linkML schema. But if a core file is missing, the crawler never encounters the file in question, so the relevant rules that would assert the necessity of the file's presence are never applied. I could certainly create an issue for this if it's helpful, but I was unsure if it's appropriate given the fact that the validator has not been publicly released, and this feature may be scheduled for later in the development plan.

Here is a link to our linkML schema model as it currently stands (in development). At the moment it is not really intended to be used with the standard linkML validator library, and is more being used as just a structured, machine-readable implementation of our schema.

bleonar5 commented 4 months ago

@TheChymera Here is a minimal Psych-DS file structure, from the Psych-DS spec document, whose contents we are in the process of integrating into a more holistic readthedocs site for the project/schema/validator

Screenshot 2024-03-01 at 3 23 35 PM

Thank you for the heads up about the dead links, I will do a once-over on our read me and take care of those ASAP

yarikoptic commented 4 months ago

Thank you @bleonar5 !!

In a lot of behavioral datasets, this is not the case, as the whole set of responses for a given subject may be representable in a single row, and one datafile may represent the data gathered from an entire experiment.

similar aspect relevant to phenotype data, per our discussion with @surchs. If I would recall correctly we arrived (or I forced ? ;) ) to the conclusion that there could be a "nominal data representation": per sub/ses representation (even if a single row) + derived composition somewhere else -- after all the notion of the "derivative" dataset is steadily becoming less of an ugly duck in BIDS world. But also it might relate to the discussion of

Sorry if this feels "too jumpy", but I think there is a common pattern emerging here across different aspects ;)

I could certainly create an issue for this if it's helpful

please do, or let me know if I should do -- since it does sound like a true bug since validator must error out if any of the required files in files/common/core.yaml is missing.

niksirbi commented 2 weeks ago

Hey @yarikoptic

Sorry for the delay in replying, we (me and @JoeZiminski) were aiming to write a full post on this, intended for our website but things have been busier than expected. Please find below a summary of the logic behind NeuroBlueprint, where it diverges with BIDS, in what ways BIDS is not fulfilling our requirements, and how this could remedied with with BIDS 2.0.

For context, we recently wrote two blog posts motivating NeuroBluerint and the related data-management tool datashuttle.

NeuroBlueprint motivation

The main motivation for NeuroBlueprint is to provide a version of folder standardisation with a very low barrier for entry, mostly focused on the data acquisition stage of a project. We found BIDS, while necessarily detailed with the aim of full standardisation and reproducibility, can be too detailed for researchers very busy in the early stages of a project. For our purposes, we at this stage just want to know where researchers' data are in a predictable way, for ingestion into analysis pipelines.

A more minor consideration was that BIDS is somewhat biased towards techniques used in human subjects (MRI, EEG, MEG), while NeuroBlueprint is more geared towards systems neuroscience (animal subjects), similar to NWB. While BIDS is slowly moving towards accommodating such data (most notably with BEP032 for animal ephys), the "human legacy" still informs much of its design and terminology.

The founding idea of NeuroBlueprint is that some standardisation is preferable to no standardisation. Our initial goal was to present systems neuroscientists with a small subset of BIDS requirements (those that are easy to adhere to at the data acquisition stage), which would make it easier for researchers to transition to "full BIDS" (or NWB) later, at the stage of paper publication and data sharing. However, while iterating on the NeuroBlueprint spec we realised that we had to break with BIDS in some areas, even within the subset we mandate.

Divergences from BIDS

Unlike BIDS, we currently make no requirements on metadata, file names, file format, etc. Essentially the only things we do require at present are a BIDS-style folder hierarchy and naming, specifically:

Datatypes and modalities We have ended up 'actively' diverging from BIDS in this area. For researchers creating file / folder names manually, the BIDS 'modality' concept, in which different modalities are distinguished by suffix, was not very convenient. For example, if collecting two types of microscopy data stored in a micr folder, the researcher would need to make sure to never forget to add the correct suffix to a filename, lest they get their data types confused. This convention seems like an optimisation better suited for the data-publication stage of a project (which we aim to write converters for in future) whereas in the acquisition stage it is easier to separate things into distinct folders.

Moreover, the existing BIDS datatypes do not map well onto the methods typically used in systems neuroscience labs. For example, BIDS reserves anat for structural MRI data and func for fMRI, neither of which are frequently acquired in our field. Another example is that BIDS employs micr for microscopy data, but the spec seems to be designed primarily for structural microscopy and doesn't adequately cover in vivo functional imaging techniques ('optical physiology', e.g. calcium imaging), which are abundant in animal neuroscience. Though we could possibly "coerce" such data to fit within micr, we are not sure that's necessarily desirable, as structural and functional imaging data are likely to be pre-processed and analysed in completely different ways (analogous to MRI's anat and func).

As such, we have taken liberties with datatype names, and at present we only mandate 4 datatypes (behav, ephys, funcimg, anat). We use the anat datatype for any kind of anatomical (structural) imaging, and funcimg for any kind of functional imaging. We are planning to extend these, but will almost certainly require everything be separated into different datatype-level folders instead of relying on modality-specific suffixes. There's an ongoing discussion on this topic here and we'd love to hear your thoughts.

Although this seems like a major divergence, we thought it would be relatively easy to reconcile once a project is complete, with appropriate converters. It would essentially involve moving/renaming datatypes and adding appropriate modality suffixes as needed.

Wish list for BIDS 2.0

As is apparent from above, what we'd love to see in BIDS 2.0 is a re-thinking of the datatype/modality concept, or at least some room for flexibility in defining/naming datatypes.

The absolute dream would be to have BIDS 2.0 consist of a set of specific and atomic rules, the same way that linters like ruff do. In this way, it would be very easy for others like us to adopt a specific subset of rules, by specifying which to include/exclude etc. We could even have "rulesets", i.e. pre-specified sets of rules that are useful in specific scenarios (e.g. "data acquisition", "uploading to OpenNeuro/DANDI", etc.). I'm aware that this is too much to ask for, technically challenging, and likely out-of-scope for BIDS, but one can dream.

Conclusion

We absolutely love what BIDS has done for the neuroimaging community and we are on board with extending its benefits to other research communities. NeuroBlueprint is still young and many of its points are still amenable to change, as long as we stick to our main design consideration, which is to keep the spec minimal and easy to adopt. Let's keep the conversation going!