BEP for participant-level "mega-analysis" of multiple datasets (possibly with non-compliant derivative folders)

spisakt commented 3 years ago

Hi Everyone, I would like to initiate a discussion about how the BIDS-principles could be applied for participant-level meta-analyses (often called mega-analyses, see e.g. this). Such analyses are typically collaborative efforts, encompassing numerous studies, and dealing with heterogenous, possibly (and sometimes typically) non BIDS-compliant derivative folders.

The main challenges:

the overarching outcomes-of-interest of a mega-analysis likely follows different naming conventions across the participating studies (e.g. different names for the same experimental condition)
data of participating studies are shared in a highly heterogenous format (from fully BIDS compliant to arbitrary collection of files), especially when the mega-analysis deals with derivatives (e.g. beta-maps, like in the linked study)
full conversion to BIDS would be in most cases an overkill; retaining the original data structure as much as possible is advantageous (only the authors of the single studies possess the full know-how for interpreting all data and metadata, which remains more accessible if format is unchanged)

With my colleagues in the Placebo Imaging Consortium, we have came up with a relatively lightweight solution to these challenges (being already deployed as the database is extended with new studies) The core of our approach lies in indexing non-compliant (or possibly even compliant) derivative directories with a new json sidecar (that is, handling them as an "atomic" data object), plus maintaining some overarching (mega-analysis related) metadata in an optional top-level folder.

Our solution is meant to be fully backward compatible with the original BIDS specification.

As some of the challenges being tackled here must be quite common, we thought it might make sense to share it with the community and start discussing it as a potential BEP.

For more details, please see our draft:

BIDS-MEGA

Comments and feedback are welcome!

Best wishes, Tamas

effigies commented 3 years ago

Awesome! Mega-analyses and multiverse analyses are something I've been thinking about a bit lately, so it's great to see something that already exists. I assume you've seen work on single-study GLMs?

cc @bids-standard/models-stats

andersonwinkler commented 3 years ago

Hi Tamas,

I find this very interesting and an important effort. We completed recently one mega-analysis and have 3 others ongoing (for the one completed, the methods are described here; the results are about to appear in Translational Psychiatry). For all these, the sites sent us the original, raw (unprocessed) data, and we standardized to BIDS. We did this to nearly all sites, except our own data (we are also a contributing site) and for recent datasets publicly available, such as ABCD and CMI/HBN. For the one completed, with 4000+ subjects, it took 1 person 2 days of work (structural MRI only). For one that is ongoing (fMRI), some 3000 subjects, it took anohter person about 3-4 days, with some extra days to sort out the "IntendedFor" fields for the sites that also sent us the fieldmaps. For another that is ongoing, with less than 1000 subjects, we expect to be able to have 1 person doing in 1 day (next Tuesday, in fact!).

Hence, from this experience, we can say that, while it does require some time, and also good shell skills, it isn't that difficult. Plus, the process of converting was informative and helped to identify and act early on various inconsistencies discovered along the way, such as mismatches between the imaging data and the pairing covariates files, files that were corrupted during the transfer, etc, and we could then reach out to sites for clarifications immediately. We were also able to name BIDS entities such as session, run, rec, and task the way we wanted (ses-1, ses-2,..., or ses-YYYYMMDD).

There is of course interest in preserving/documenting the mapping between the original file structure (however organised it is) and BIDS. We dealt with the problem by storing the mapping in a .csv file with two columns, original name and new name. But in fact, we didn't have to: since there's no editing of the files we received (kept in a separate directory, read-only), the mapping can always be recovered by matching the file hashes or, if the copy in BIDS changes, image subtraction or simply a diff of .json files when these also came.

From this, my conclusion is that a new standard may not add that much benefit, and in fact, could add some extra work. Nonetheless, once the BIDS-MEGA is ready, our group will definitely embrace it.

Another point is, because the derivatives can contain potentially anything, I wonder if the indexing couldn't be used to subvert future versions of BIDS in the following manner: put non-compliant raw data in the derivatives folder, index such that these files are mapped into valid BIDS raw files, and done, we have non-compliant disguised as compliant, and that can potentially be understood as such, if not yet by FMRIPREP or MRIQC, potentially by these or other tools in the future. Is this bad? I don't know, only something to think about.

Regardless, this is important work. I'll try to read the proposal in more detail in the upcoming week and invite some collaborators to contribute.

All the best,

Anderson

spisakt commented 3 years ago

Thanks. This is supposed to be a first step towards being able to store the raw data properly (raw data for the mega-analysis might be derivative from the point of view of the single studies), not much thoughts were made yet about how the analysis results should be structured (except specifying the parent directory). But the current idea provides strong analogies with how BIDS handles single study level derivatives, with possibilities to adapt e.g. BEP002 to mega-analyses or even going multiverse...

Awesome! Mega-analyses and multiverse analyses are something I've been thinking about a bit lately, so it's great to see something that already exists. I assume you've seen work on single-study GLMs?

cc @bids-standard/models-stats

spisakt commented 3 years ago

Hi Anderson,

Thanks for the valuable feedback and for your very helpful comments in the proposal doc. Please see our replies there, and let me give a somewhat longer reply to the general issues you raised here (later to be incorporated into the proposal draft, as well).

Regarding the example cases you mention: I couldn't agree more that mega-analyses dealing with raw data are supposed to be "fully" transformed into BIDS-raw, as BIDS, by design, provides advantages that massively outweigh the (not too heavy) costs of the data transformation. That is what we also do when collecting, e.g., raw T1-images. Please note that - as our proposal draft is backward compatible with the current BIDS version - it trivially covers storing BIDS-format raw data, as available in such - usually centrally orchestrated - projects, with unrestricted data sharing possibilities. The only extension we propose in such cases is an (optional) "top level" folder, encompassing the single studies' BIDS-folders and the overarching metadata. This top-level directory structure is a very straightforward (almost trivial) extension to the current specification and introduces several useful analogies between "participants" and "studies", making the whole meta/mega-analysis dataset instantly digestable to everyone, who has some prior experience with BIDS-raw. The extra costs of implementing the proposed structure are really minimal (literally no more than creating 4-5 files/directories), largely outweighed by its benefits.

Regarding the other key part of our proposal - i.e. providing a standrad "proxy" for handling noncompliant derivative folders: indeed, this feature might be unnecessary for projects where having noncompliant derivative folders can be fully avoided (as it is in the projects you mention).

However - and this might not have been clean enough in the proposal - having noncompliant derivative folders is inevitable in some other - very realistic - cases:

First, while the BIDS community shows an amazing progress with extending the scope of BIDS to various kinds of derivatives, I don't think that BIDS will ever cover all possible kinds of derivatives. (And I also don't think it should.)
Second, even if a new type of derivative is to be integrated into BIDS, there will be always some delay with integrating the latest techniques into the specification.
A third issue is faced when working with - possibly incomplete - output of old software versions (as often shared by the collaborators in our projects). It is unlikely that there will ever be much support for converting such data to fully BIDS-compliant format. Trying to come up with a general solution to "fully" convert such datasets - especially in a large, heterogenous mega-analysis - raises new questions from dataset to dataset, is hard to atomize and likely results in ad-hoc solutions that might be prone to errors.

Datasets affected by any of these three points do end up with noncompliant derivatives, anyway. (I think that was very well recognized when noncompliant derivatives were made part of the original specification).

The proposed "indexing" (with noncomp.json files) is nothing more than an optional "standard interface" to such inevitable noncompliant folders, providing the chance to host them in a self-contained, human readable way. Having such an option is really essential in our use-case and might be very helpful, preasumably, in many other projects in the future. As noncompliant derivative folders are allowed in BIDS anyway, the proposed solution can very peacufully sit "on top" of the existing BIDS specification (for instance, a BIDS-dir with such indexed noncompliant folders is recognized as a valid dataset by the current BIDS-validator, already).

You raise a very important question regarding the potential of such indexed noncompliant folders to subvert their "normal" BIDS-compliant sieblings. We have also extensively discussed this issue while working on the draft and came to the concusion hat there are multiple reasons to think this won't be a real problem on the long run. First, in contrast to raw data specifications, derivative specifications are primarily to be adapted not by the "end-users", but by the developers of the corresponding tools, who can be expected to be well-informed enough to understand the priority to use "fully compliant" derivatives (we have a very concrete disclaimer in the proposal, to this end). Second, the proposed indexing technique is - by design - way too goal-oriented for generic use, as it links the data to specifc "conditions of interests", unique for the analysis at hand. This - intentionally - renders indexed noncompliant derivatives folders as a very unintuitive and unappealing way of storing general purpose meta-data. I am sure the specification can be further improved to make it obvious that this option aims to establish semantic links between datasets and overarching condition-of-interest at a higher level and is simply not appropriate for storing general-purpose meta-data.

Actually, I think that providing a similar option for the BIDS users might eventually even foster the more widespread adaptation of BIDS, as in lack of that, many reserachers might turn to alternative solutions (I know concrete examples).

I understand that some of these questions go far beyond simply agreeing on a standard solution for handling very heterogenous datasets and have a rather "strategical" aspect for the whole BIDS-initiative. Therefore, for this proposal it is essential to get as much input as possible from (i) the BIDS core team regarding such strategical considerations (ii) form colleagues with extensive and unique meta-/mega-analysis experience (like you), to find the solutions that are satisfying for the majority of the use cases and (iii) from the general community, to ensure that the final solution remains clean, consicise and appealing. I am positive that the consensus will be quickly reached in the above questions so that we will be able to empower BIDS to tackle costum derivatives and heterogenous mega-analysis datasets while preserving its simplicity and clarity.

Some of these key thoughts - and especially the relevance to various types of mega-analyses - might not be clearly put in the current proposal draft (or might go lost among the many specification details). We aim to enhance the intro part with all these relevant aspects soon. So please consider the current version of the doc simply as a starting point for further discussion and feel free to come up with any alternatives to, or enhancement on, the proposed extensions. Inviting collaborators is also very-very welcome, many thanks for that!

Best wishes,

Tamas

on behalf of the draft moderators

Hi Tamas,

I find this very interesting and an important effort. We completed recently one mega-analysis and have 3 others ongoing (for the one completed, the methods are described here; the results are about to appear in Translational Psychiatry). For all these, the sites sent us the original, raw (unprocessed) data, and we standardized to BIDS. We did this to nearly all sites, except our own data (we are also a contributing site) and for recent datasets publicly available, such as ABCD and CMI/HBN. For the one completed, with 4000+ subjects, it took 1 person 2 days of work (structural MRI only). For one that is ongoing (fMRI), some 3000 subjects, it took anohter person about 3-4 days, with some extra days to sort out the "IntendedFor" fields for the sites that also sent us the fieldmaps. For another that is ongoing, with less than 1000 subjects, we expect to be able to have 1 person doing in 1 day (next Tuesday, in fact!).

Hence, from this experience, we can say that, while it does require some time, and also good shell skills, it isn't that difficult. Plus, the process of converting was informative and helped to identify and act early on various inconsistencies discovered along the way, such as mismatches between the imaging data and the pairing covariates files, files that were corrupted during the transfer, etc, and we could then reach out to sites for clarifications immediately. We were also able to name BIDS entities such as session, run, rec, and task the way we wanted (ses-1, ses-2,..., or ses-YYYYMMDD).

There is of course interest in preserving/documenting the mapping between the original file structure (however organised it is) and BIDS. We dealt with the problem by storing the mapping in a .csv file with two columns, original name and new name. But in fact, we didn't have to: since there's no editing of the files we received (kept in a separate directory, read-only), the mapping can always be recovered by matching the file hashes or, if the copy in BIDS changes, image subtraction or simply a diff of .json files when these also came.

From this, my conclusion is that a new standard may not add that much benefit, and in fact, could add some extra work. Nonetheless, once the BIDS-MEGA is ready, our group will definitely embrace it.

Another point is, because the derivatives can contain potentially anything, I wonder if the indexing couldn't be used to subvert future versions of BIDS in the following manner: put non-compliant raw data in the derivatives folder, index such that these files are mapped into valid BIDS raw files, and done, we have non-compliant disguised as compliant, and that can potentially be understood as such, if not yet by FMRIPREP or MRIQC, potentially by these or other tools in the future. Is this bad? I don't know, only something to think about.

Regardless, this is important work. I'll try to read the proposal in more detail in the upcoming week and invite some collaborators to contribute.

All the best,

Anderson

spisakt commented 2 years ago

Hi Everyone,

Our proposal has been revised:

modular format
example-driven, concise summaries
detailed specification proposal
two example datasets (third in progress)

Click to read it: https://docs.google.com/document/d/1tFRNumQyIgjXBNC3brFDLO9FaikjL84noxK6Om-Ctik/edit?usp=sharing

Best wishes, Tamas

bids-standard / bids-specification

BEP for participant-level "mega-analysis" of multiple datasets (possibly with non-compliant derivative folders) #880

The main challenges:

For more details, please see our draft: