Use-case(s): BIDS-inspired/like standards

yarikoptic commented 9 months ago

Quite often projects do not adopt BIDS due to complexity or not perfect fit, and then establish new a file layout and/or metadata convention/standard while saying they are "BIDS-like". The likeness varies greatly. Quite often it is simply the aspect of having folders and file names with some metadata in them. Those are not worth mentioning here. But there is a good number of BIDS-like standards (in my words - formalized descriptions adopted by a considerable number of people) which are worth reviewing and analyzing for what could minimize divergence between them and current BIDS through possibly introducing missing but reasonable and desired features into BIDS 2.0.

This issue would be used to collect pointers and possibly summarize rationale and major features behind them.

DANDI layout

Established by me and @satra for https://dandiarchive.org, primarily due to complete lack of usable standard/layout at that earlier point in time .

There is no specification, and layout is largely "enforced" via dandi organize command on a set of .nwb files.
dandiset.yaml schema is defined within pydantic model in https://github.com/dandi/dandi-schema/blob/master/dandischema/models.py#L1405
see https://github.com/dandi/dandi-cli/issues/1498

Notable divergences:

- no `dataset_description.json` - metadata is in "in-house" `dandiset.yaml` - no `ses-*/` level subfolder but there is `ses-` entity in the target filenames (convergence possible through #54) - no datatype (AKA modality) subfolder (related: #55, convergence possible through #54) - main file format is `.nwb` (convergence through https://bids.neuroimaging.io/bep032 for animal ephys data, discussed/not (yet) accepted in BIDS 1.0 for `micr/`: https://github.com/bids-standard/bids-specification/pull/1632) - entity labels/values could contain `-` and `+`. (TODO: ref BIDS PR) - Suffix can contain `+` (no PR yet I think) - Suffix contains (often multiple) data modalities contained within the single .nwb file, concatenated with `+`

PsychDS

https://psych-ds.github.io/ https://github.com/psych-ds/psych-DS (attn @mekline and @bleonar5 - would appreciate details/feedback alike for DANDI here or in a dedicated issue/doc)

Overall motivation: Re-use BIDS schema formalization to derive a customized subset of the BIDS standard to simplify domain specific use and adoption.

NeuroBlueprint

https://neuroblueprint.neuroinformatics.dev/specification.html . Request for summarization of rationale/divergences: https://github.com/neuroinformatics-unit/NeuroBlueprint/issues/51

SPARC Data Structure (SDS)

The SPARC platform has a preprint online that describes a BIDS inspired data structure: https://www.biorxiv.org/content/10.1101/2021.02.10.430563v2. Rough details are also on their wiki.

TemplateFlow

https://www.templateflow.org/usage/archive/#acceptable-data-types

A related discussion "has happened" in https://github.com/bids-standard/bids-specification/issues/1281

Brain-Development.org Atlas

https://brain-development.org/brain-atlases/atlases-from-the-dhcp-project/cortical-surface-template/ describes itself as "using BIDS conventions", and proceeds to define custom entities and metadata.

NiPoppy

Study-level description which includes bids dataset and uses some conventions (like derivatives/ subfolder with clearer defined naming convention).

https://github.com/bids-standard/bids-specification/pull/1861 attempts to converge NiPoppy and BIDS

CAPS

ClinicA Processed Structure:

When the development of Clinica started in 2015, the BIDS specifications did not provide specific rules for the processed data. As a result, the goal of CAPS (designed by the Aramis Lab) was to define a hierarchy for the data processed using Clinica. The idea is to include in a single folder all the results of the different pipelines and organize the data following the main patterns of the BIDS specification.

Several differences exist between the BIDS and CAPS specifications.

Instead of the BIDS derivatives/ folder, the processed data are stored in the CAPS folder.

CAPS assumes that the session is always present in a BIDS dataset even though there is a single session. In other words, all datasets are considered longitudinal, even when they have only one session.

NIfTI files generated by a Clinica pipeline are always compressed. Compression is recommended by the BIDS specification but not mandatory.

TODOs:

[ ] represent/quantify divergences from BIDS in form of a table. Dimensions/features
- need for extra entities
- need for alternative folders hierarchy
  - need for new modalities
- need for other file formats (there MINC BIDS?)

satra commented 9 months ago

ping @tgbugs for SPARC info on bids-like

dorahermes commented 9 months ago

SPARC Data Structure (SDS)

The SPARC platform has a preprint online that describes a BIDS inspired data structure: https://www.biorxiv.org/content/10.1101/2021.02.10.430563v2. Rough details are also on their wiki.

edit by @yarikoptic : thanks -- added

satra commented 9 months ago

also pinging @saskiad and @dyf for bids inspired data container for the allen institute for neural dynamics

yarikoptic commented 9 months ago

also pinging @saskiad and @dyf for bids inspired data container for the allen institute for neural dynamics

after review with @dyf we agreed that it was a little too distant from BIDS, at most indeed just "inspired" ;), so we have it as

a separate issue: #60

satra commented 9 months ago

that's good. although, i'm not quite sure that use-case is too distant. practically speaking, we run into that issue with ukbiobank and any of the large datasets where we need to process a few subjects with bids-apps. we have had to create sub-bids to satisfy bids and hence tools like fmriprep. one could say that bids should not care about that. however, a self-contained single subject/session subset would be a relevant use-case in bids i believe.

ps. the challenge has been that bids has the different levels of consolidation of information: grouped (participants, sessions, etc) inheritance (via jsons), and single (the files). this necessitates a connected structure that relies on those pieces of information. the advantage of bids is efficiency (for the grouped files; although longitudinal data is inefficient) and deduplication (for the inheritance), and readability (single path+filename).

these different use cases may be good to consider in bids 2.

effigies commented 9 months ago

Will also add Templateflow: https://www.templateflow.org/usage/archive/#acceptable-data-types

https://brain-development.org/brain-atlases/atlases-from-the-dhcp-project/cortical-surface-template/ describes itself as "using BIDS conventions", and proceeds to define custom entities and metadata.

niksirbi commented 9 months ago

Our NeuroBlueprint specification is definitely in the BIDS-inspired category. Together with @JoeZiminski we are working on putting together a list of divergences from BIDS and the rationale behind them.

yarikoptic commented 9 months ago

however, a self-contained single subject/session subset would be a relevant use-case in bids i believe. ... these different use cases may be good to consider in bids 2.

@satra see

https://github.com/bids-standard/bids-2-devel/issues/59 which was born out of the #60 I have mentioned, and
https://github.com/bids-standard/bids-2-devel/issues/54

May be we should add/promote upvoting via :+1: on the issues, so "go wild" ;)

bleonar5 commented 9 months ago

I can share some context on Psych-DS if it's helpful, but I'm unsure how to comment on DANDI. Our data standard is pretty explicitly modeled on BIDS and our validator tool is a essentially a very pared-down fork of BIDS' recent deno implementation.

Definitely our reasoning for diverging instead of using BIDS directly or creating some sort of module within it has to do with complexity. A big part of the ethos of our project is simplicity, since we're trying to bring researchers with a lack of experience with explicit data standards into the fold of producing FAIR datasets. We designed our standard to be the minimal set of conventions for producing consistently-structured, machine-readable datasets with linked metadata, and we avoided the impulse to include additional advanced/options conventions or conventions governing the internal content of datafiles, because we figured that even the presence of this additional material in our documentation could scare off our target audience.

On a technical level, I noticed that the BIDS deno-based validator only applied rules to files that actually appeared within datasets, with no functionality to produce errors/issues in cases where certain elements were absent, and since this notion of presence/absence was important in our schema, that was one initial additional impetus for diverging with our own tool. Additionally, with Psych-DS 1.0 at least, we only wanted to validate simple tabular CSV data, and a lot of the structure of the BIDS validator had to do with applying different rules and conventions depending on datatype.

We followed BIDS' lead when it came to our usage of linkML for creating a structured model of our schema, and I definitely used BIDS' examples explicitly when developing our stack of tools, which was extremely helpful.

These are just a few random thoughts and pieces of context. Feel free to ask me anything specific and I can answer in detail. Also, I should mention that @mekline is on maternity leave until sometime this June/July and @ianchandlercampbell is our interim director for the project.

yarikoptic commented 9 months ago

That is very valuable insights @bleonar5 , thank for sharing! Given that you seems to use BIDS schema, my short overarching summary would be to: Re-use BIDS schema formalization to derive a customized subset of the BIDS standard to simplify domain specific use and adoption. If I am totally off -- please correct me ;)

I noticed that the BIDS deno-based validator only applied rules to files that actually appeared within datasets, with no functionality to produce errors/issues in cases where certain elements were absent, and since this notion of presence/absence was important in our schema, that was one initial additional impetus for diverging with our own tool.

Could you elaborate here (or even as a dedicated issue against bids-validator, which is if not pertinent - could be closed) more on this since I am not fully grasping, since as to me bids-validator must error whenever any REQUIRED component (metadata or file) is missing.

Additionally, with Psych-DS 1.0 at least, we only wanted to validate simple tabular CSV data, and a lot of the structure of the BIDS validator had to do with applying different rules and conventions depending on datatype.

also sounds very intriguing and like something what could be generally applicable to BIDS. Could you elaborate more?

We followed BIDS' lead when it came to our usage of linkML for creating a structured model of our schema

do you have a link to linkML models handy?

TheChymera commented 9 months ago

@bleonar5 some of the links on the PsychDS README are broken, could you share a tree view of a dataset?

reasoning for diverging [...] has to do with complexity. A big part of the ethos of our project is simplicity,

I think that's also part of the ethos of BIDS, perhaps we could look into simplifying BIDS for 2. as well. What are some of the key complexity concerns which made BIDS 1. less attractive?

bleonar5 commented 9 months ago

@yarikoptic My first response was a bit cursory and based on my memory of our initial rationales for diverging, I'll try to dig in a bit deeper here. I think your summary of our rationale was correct: we wanted to provide similar structures and standards to those that BIDS provides, for researchers that deal with behavioral data rather than complex physiological data. @satra informed us about the BIDS' team's development of a structured schema model in linkML, and this satisfied one of our core desiderata for the project, which was to have an externalized, structured schema that we could reference across validator tools in multiple frameworks (node, R, python). So, we used the combination of pruned-down versions of BIDS' in-development Deno validator and linkML schema as (very helpful) jumping off points for our own development, with proper citations and acknowledgements, of course.

@mekline has had a much longer history with the development of Psych-DS as an independent entity, and could possibly speak to our rationales for divergence much better, and she may be able to share more detailed thoughts on her return. One crucial element that I'm remembering now is a technical difference between the structure of most physio data and the behavioral data that we're interested in. Physiological data is so rich and often tied to multiple measurements over time, that it seems to be a standard assumption (and I think this is reflected in the BIDS spec) that datafiles will be organized around individual subjects/sessions. In a lot of behavioral datasets, this is not the case, as the whole set of responses for a given subject may be representable in a single row, and one datafile may represent the data gathered from an entire experiment. BIDS is complex and my knowledge of/experience with it only extends to the research I did prior to beginning development of the Psych-DS validator, but it seemed to us that following some kind of subject-oriented system of data organization would be necessary for compliance with BIDS, and this was a major rationale for divergence. (@TheChymera, I think this paragraph is the most relevant answer to your second question)

As for the matter of presence/absence of files/directories that I mentioned previously, I think this is actually just an issue with the deno-based validator rather than the older, public-facing validator. And the deno validator is still in development, so it may just be that I mistook a bug/unfinished component for an actual aspect of the BIDS spec. Basically, if you provide an otherwise-valid BIDS dataset that is missing an element (such as the dataset_description.json file) to the web validator, it produces an error as expected (DATASET_DESCRIPTION_JSON_MISSING). If you do the same with the deno validator, it outputs a VALID_DATASET result and does not report the absence of the required file. This is because the validator crawls the filetree of the dataset, finds whatever files/directories are present, and runs a series of checks on them based on the rules in the linkML schema. But if a core file is missing, the crawler never encounters the file in question, so the relevant rules that would assert the necessity of the file's presence are never applied. I could certainly create an issue for this if it's helpful, but I was unsure if it's appropriate given the fact that the validator has not been publicly released, and this feature may be scheduled for later in the development plan.

Here is a link to our linkML schema model as it currently stands (in development). At the moment it is not really intended to be used with the standard linkML validator library, and is more being used as just a structured, machine-readable implementation of our schema.

bleonar5 commented 9 months ago

@TheChymera Here is a minimal Psych-DS file structure, from the Psych-DS spec document, whose contents we are in the process of integrating into a more holistic readthedocs site for the project/schema/validator

Thank you for the heads up about the dead links, I will do a once-over on our read me and take care of those ASAP

yarikoptic commented 9 months ago

Thank you @bleonar5 !!

In a lot of behavioral datasets, this is not the case, as the whole set of responses for a given subject may be representable in a single row, and one datafile may represent the data gathered from an entire experiment.

similar aspect relevant to phenotype data, per our discussion with @surchs. If I would recall correctly we arrived (or I forced ? ;) ) to the conclusion that there could be a "nominal data representation": per sub/ses representation (even if a single row) + derived composition somewhere else -- after all the notion of the "derivative" dataset is steadily becoming less of an ugly duck in BIDS world. But also it might relate to the discussion of

36

where similarly a decision could be made to aggregate metadata at the higher levels since it is the most useful and sensible representation. And needless to say that it does relate to
54

as if there is no sub- folder -- data per subject should reside somewhere/somehow ;)

Sorry if this feels "too jumpy", but I think there is a common pattern emerging here across different aspects ;)

I could certainly create an issue for this if it's helpful

please do, or let me know if I should do -- since it does sound like a true bug since validator must error out if any of the required files in files/common/core.yaml is missing.

niksirbi commented 5 months ago

Hey @yarikoptic

Sorry for the delay in replying, we (me and @JoeZiminski) were aiming to write a full post on this, intended for our website but things have been busier than expected. Please find below a summary of the logic behind NeuroBlueprint, where it diverges with BIDS, in what ways BIDS is not fulfilling our requirements, and how this could remedied with with BIDS 2.0.

For context, we recently wrote two blog posts motivating NeuroBluerint and the related data-management tool datashuttle.

NeuroBlueprint motivation

The main motivation for NeuroBlueprint is to provide a version of folder standardisation with a very low barrier for entry, mostly focused on the data acquisition stage of a project. We found BIDS, while necessarily detailed with the aim of full standardisation and reproducibility, can be too detailed for researchers very busy in the early stages of a project. For our purposes, we at this stage just want to know where researchers' data are in a predictable way, for ingestion into analysis pipelines.

A more minor consideration was that BIDS is somewhat biased towards techniques used in human subjects (MRI, EEG, MEG), while NeuroBlueprint is more geared towards systems neuroscience (animal subjects), similar to NWB. While BIDS is slowly moving towards accommodating such data (most notably with BEP032 for animal ephys), the "human legacy" still informs much of its design and terminology.

The founding idea of NeuroBlueprint is that some standardisation is preferable to no standardisation. Our initial goal was to present systems neuroscientists with a small subset of BIDS requirements (those that are easy to adhere to at the data acquisition stage), which would make it easier for researchers to transition to "full BIDS" (or NWB) later, at the stage of paper publication and data sharing. However, while iterating on the NeuroBlueprint spec we realised that we had to break with BIDS in some areas, even within the subset we mandate.

Divergences from BIDS

Unlike BIDS, we currently make no requirements on metadata, file names, file format, etc. Essentially the only things we do require at present are a BIDS-style folder hierarchy and naming, specifically:

We adopt the BIDS distinction between rawdata vs derivatives at the top level (though there's no sourcedata).
We mandate the hierarchical organisation of rawdata into subject > session > datatype sub-folders (though unlike BIDS, the session level can't be skipped).
The naming of subject- and session-level folders follows the BIDS convention of key-value pairs separated by underscores, but unlike BIDS we don't define a strict set of entities for users to choose from. Instead, me only mandate that the first entity is sub-<index> for subject- and ses-<index> for session-level folders (to at least ensure that these folders can be uniquely identified).
The biggest break with BIDS comes at the level of data types and modalities (see section below).
We suggest, but do not require, that files are named following BIDS-like key-value pair conventions, but that's not a strict requirement at present. Similarly, we recommend, but do not require, that tabular metadata are saved in BIDS-style TSV files.

Datatypes and modalities We have ended up 'actively' diverging from BIDS in this area. For researchers creating file / folder names manually, the BIDS 'modality' concept, in which different modalities are distinguished by suffix, was not very convenient. For example, if collecting two types of microscopy data stored in a micr folder, the researcher would need to make sure to never forget to add the correct suffix to a filename, lest they get their data types confused. This convention seems like an optimisation better suited for the data-publication stage of a project (which we aim to write converters for in future) whereas in the acquisition stage it is easier to separate things into distinct folders.

Moreover, the existing BIDS datatypes do not map well onto the methods typically used in systems neuroscience labs. For example, BIDS reserves anat for structural MRI data and func for fMRI, neither of which are frequently acquired in our field. Another example is that BIDS employs micr for microscopy data, but the spec seems to be designed primarily for structural microscopy and doesn't adequately cover in vivo functional imaging techniques ('optical physiology', e.g. calcium imaging), which are abundant in animal neuroscience. Though we could possibly "coerce" such data to fit within micr, we are not sure that's necessarily desirable, as structural and functional imaging data are likely to be pre-processed and analysed in completely different ways (analogous to MRI's anat and func).

As such, we have taken liberties with datatype names, and at present we only mandate 4 datatypes (behav, ephys, funcimg, anat). We use the anat datatype for any kind of anatomical (structural) imaging, and funcimg for any kind of functional imaging. We are planning to extend these, but will almost certainly require everything be separated into different datatype-level folders instead of relying on modality-specific suffixes. There's an ongoing discussion on this topic here and we'd love to hear your thoughts.

Although this seems like a major divergence, we thought it would be relatively easy to reconcile once a project is complete, with appropriate converters. It would essentially involve moving/renaming datatypes and adding appropriate modality suffixes as needed.

Wish list for BIDS 2.0

As is apparent from above, what we'd love to see in BIDS 2.0 is a re-thinking of the datatype/modality concept, or at least some room for flexibility in defining/naming datatypes.

The absolute dream would be to have BIDS 2.0 consist of a set of specific and atomic rules, the same way that linters like ruff do. In this way, it would be very easy for others like us to adopt a specific subset of rules, by specifying which to include/exclude etc. We could even have "rulesets", i.e. pre-specified sets of rules that are useful in specific scenarios (e.g. "data acquisition", "uploading to OpenNeuro/DANDI", etc.). I'm aware that this is too much to ask for, technically challenging, and likely out-of-scope for BIDS, but one can dream.

Conclusion

We absolutely love what BIDS has done for the neuroimaging community and we are on board with extending its benefits to other research communities. NeuroBlueprint is still young and many of its points are still amenable to change, as long as we stick to our main design consideration, which is to keep the spec minimal and easy to adopt. Let's keep the conversation going!

tgbugs commented 2 months ago

As a placeholder for deeper discussion here are some links that provide a partial overview of SDS. I have a poster on this an Neuroinformatics so can update with that here as well when it is ready.

How SDS models ontological participants that was inspired by the discussion in https://github.com/bids-standard/bids-specification/issues/779.

Changelog for the latest version of SDS and the actual release.

The most important change in this context would be our move to accommodate data management processes where most or all file metadata is stored in a manifest file of some kind, that might look like mapping file names, object ids, checksums, s3 object paths, etc. to metadata records in a separate system in addition to the traditional folder naming conventions. I think this touches on

The manifest changes also relate to how BIDS-like standards can enforce data modality file type requirements (e.g. that mri should have nifti files, ephys should have nwb, microscopy ome-tiff, etc. while allowing png files elsewhere in the dataset) without necessarily having to have folders for each modality which can result in a nxm-fold increase in directories for n subjects and m modalities.

With retard to interoperability between standards, there is discussion in the changelog about one way to make file system conventions (data set standards) nestable using a file called .dss. For example if you have a data collected across modalities that already have standards such as MRI with BIDS, ephys with DANDI, peripheral physiology with SDS, etc. it is possible to combine them all without conflict in a single hierarchy.

bids-standard / bids-2-devel