Restructuring organization for participant level grouping

satra commented 3 years ago

In BIDS thus far the notion of source data and derived data is a little contrived/vague. For example a multi-echo T1-weighted recon comes out of the scanner from a MEMPRAGE sequence is considered source data, while the FA image that comes out is not considered source data.

As scanners and other instruments get more advanced and start generating what we traditionally call derivatives (think GPU based processing on the scanner), this will lead to questions of where data goes.

To simplify consideration, the possibility I would like the BIDS community to consider is to separate data not by source vs derivatives, but by participant vs ~aggregate~ non-individual. As examples:

Participant

source dicoms
freesurfer recon
fmriprep output
meg windows around individual stimuli
average ERP response ...

~Aggregate~ Non-individual

Templates
group statistical maps
(partial) correlations ...

This makes it, in my opinion, simpler to consider with regard to both metadata and with respect to provenance.

Would love to hear thoughts on this potential reframing.

tsalo commented 3 years ago

Since this proposal is for 2.0, would this issue perhaps be a better fit for bids-standard/bids-2-devel? BTW, I know that there are a couple of issues there that also propose massive restructuring (e.g., https://github.com/bids-standard/bids-2-devel/issues/28, https://github.com/bids-standard/bids-2-devel/issues/37).

satra commented 3 years ago

i don't have the authorization to transfer, but i think it would be a good place for this to go.

poldrack commented 3 years ago

to me the distinction between participant and aggregate seems equally contrived. for example, we aggregate across individual images to analyze a timeseries, aggregate across runs or sessions within participant, etc. I agree that raw vs. derived is also a bit contrived, but seems to fit better with the usual researcher's workflow. perhaps better to think of it as they do in Psych-DS, where there are "source data" i.e. data that came directly from the measurement instrument, and then various levels of derivation from that, some of which are "primary" (e.g. nifti images derived from dicoms) and others are derived (e.g. fmriprep outputs). it seems necessary that some of these concepts will necessarily be contrived, since they are meant to reflect as well as possible the usual scientist's workflow. the bigger challenge is that as BIDS expands, there is a broader set of scientists with a broader range of workflows, so the "usual" scientist becomes a contrived notion as well.

On Mon, Aug 17, 2020 at 8:13 AM Satrajit Ghosh notifications@github.com wrote:

In BIDS thus far the notion of source data and derived data is a little contrived/vague. For example a multi-echo T1-weighted recon comes out of the scanner from a MEMPRAGE sequence is considered source data, while the FA image that comes out is not considered source data.

As scanners and other instruments get more advanced and start generating what we traditionally call derivatives (think GPU based processing on the scanner), this will lead to questions of where data goes.

To simplify consideration, the possibility I would like the BIDS community to consider is to separate data not by source vs derivatives, but by participant vs aggregate. As examples:

Participant

source dicoms

freesurfer recon

fmriprep output

meg windows around individual stimuli

average ERP response ...

Aggregate

Templates

group statistical maps

(partial) correlations ...

This makes it, in my opinion, simpler to consider with regard to both metadata and with respect to provenance.

Would love to hear thoughts on this potential reframing.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-2-devel/issues/43, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUVEC3CS7ITDKTAV4A4MDSBFCIXANCNFSM4QBX7S6Q .

-- Russell A. Poldrack Albert Ray Lang Professor of Psychology Building 420 Stanford University Stanford, CA 94305

poldrack@stanford.edu http://www.poldracklab.org/

satra commented 3 years ago

@poldrack - this phrase is exactly the reason i wrote this.

"source data" i.e. data that came directly from the measurement instrument

we don't treat these things consistently (see the MEMPRAGE and FA example above). and with new tools developing that do significant processing in the scanner itself (e.g., label regions and compute volumes), we would have to as part of the source processing make determinations as to where things would go.

for example, we aggregate across individual images to analyze a timeseries, aggregate across runs or sessions within participant, etc.

but these are still individual-specific. perhaps aggregate is not the right term i meant to use. individual vs non-individual is what i wanted to convey.

tsalo commented 3 years ago

I believe that BEP001 does propose symlinking scanner-computed "derivatives" (like FA maps) from the "raw" dataset to the derivatives folder. This isn't a complete solution, but it does explicitly support derivatives coming directly from the scanner.

tyarkoni commented 3 years ago

I agree with @poldrack that any organization we try to impose is going to be intuitive for some applications and problematic for others. I don't feel I have a good sense of which of these two schemes would be preferable, and I'd suggest that we stack these kinds of proposals and then at some point do a UX survey/study asking people what they (think they) prefer.

That said, as a practical matter, I think we should try to maintain backwards compatibility with BIDS 1.0 wherever possible, unless we have a really good reason not to. So, e.g., if 80% of users say that @satra's proposal would make their life considerably easier, then sure, let's break the BIDS 1.0 structure. But if, say, 55% prefer @satra's proposal and 45% prefer the existing scheme, I'd argue that that doesn't really justify having to introduce major changes to the entire tooling ecosystem, break people's habits, etc.

satra commented 3 years ago

@tsalo - i think using symlinks is not a good option moving forward as storage providers move more towards object stores (so won't work on s3 for example).

@tyarkoni - in general i have always seen bids as a view, and a darn useful organized view, on a more complex underlying information flow model. so yes, there is no perfect view, just a pragmatic one that addresses a large set of use cases. i really like the idea of doing some A/B testing, but in general before we even implement something like this, i would like a discussion of considerations as to how many folks would find the view useful.

so here are some use cases where the participant-centered view can be useful.

aggregation of individuals across datasets
sharing/removing individual participants
decisions about where to find information about an individual
longitudinal applications within an individual
privacy protection (everything in a subject folder is subject to privacy considerations)
provenance (most things are derived from other things within a participant object)

ps. i haven't yet commented on the hierarchy principle issue, but will do so sometime soon. it's a complex issue and relates to this proposal as well.

yarikoptic commented 3 years ago

Sorry -- my reply came out long, but I think the issue is touching on many of largely orthogonal issues and should be broken into separate ones. So I added some sectioning

raw-vs-derived -- everything is derived!

In BIDS thus far the notion of source data and derived data is a little contrived/vague.

I can only repeat an idiom I think BIDS should just accept and promote: any BIDS data(set) is derived data(set). Accepting it would IMHO resolve aforementioned contradiction. It is exemplified by many already existing provisions in BIDS mentioned above and a simple fact that BIDS provisions for sourcedata/ -- in my view anything which has "source (data)" it came from is "derived (data)". I think such idiom is not in conflict with a notion that BIDS 1.x dataset to contain "raw" data (as close to the origin of the data, just merely harmonized to conform BIDS). Taking it further, common derivatives dataset is just an enhancement on top of BIDS 1.x "raw" - it is a possible overlay on top of it (i.e. can be original "raw" + processed files where necessary). I think a possible way forward is to provision in dataset_description.json a field listing the "tiers" (or "features" or ... ?) of the dataset: "raw", "common-derivatives" (or simply "processed"), and just provide guidance on when to augment "raw" BIDS with derived data (annonimization, close-to-raw preprocessing etc), and when to produce "separate" derivatives (big pipelines output).

participant-vs-aggregate -- orthogonal issue, can be BIDS 1.x compatible

... to consider is to separate data not by source vs derivatives, but by participant vs aggregate non-individual.

I think it is largely an orthogonal aspect to raw-vs-derived (again -- everything in BIDS is derived IMHO ;)). Even though hardware ATM does not produce "aggregates", I do not see why it hypothetically couldn't and my wild prediction would be that at some point it might produce population templates per study etc. So I would have added it as an additional "feature" explicitly (again annotated for in dataset_description.json with e.g. levels": ["subject", "subject/session", "session", "study"] - ATM just implicit "subject" and "subject/session" levels) or implicitly (just by fact being present under standardized location e.g. agg-<label>/ folders accompanied with aggregates.json describing aggregates or ses- for aggregates for sessions across subjects; sub-*/ without ses-/ subfolder for aggregates across sessions within subject with all "derivatives" annotation). BIDS then would standardize layout/naming in those folders to follow overall BIDS naming approach (which would largely be "drop sub- and/or ses- prefix depending on the group level" + introduce missing entities to standardize composition annotation). And most likely it could be worked out in "backward compatible" way with BIDS 1.x thus even introduced prior BIDS-2.

Composition -- yet for BIDS to standardize a bit more

Another aspect which I think is discussed above without giving it an explicit name is "composition": we have not reached an ultimate agreement and thus have not provided a definite guidance on how BIDS datasets are composed together. Yes -- it was improved significantly with common derivatives adding a 2nd "alternative" composition in common-principles. But IMHO my_dataset there should be promoted (or at least described to typically correspond to a "study level") to the explicit scope of a "study", which makes sense since there could be multiple ways to combine/process etc "raw collected data" for any study. IMHO largely due to this absent "study" level standardization (just an "alternative" now,), "raw" BIDS originally provisioned having derivatives/ only as a subfolder within raw BIDS. BIDS itself has sourcedata/ and it "scales" to the derivatives as well: any derivative dataset can have sourcedata/ pointing to (or many -- see below on SourceDatasets) source (possibly BIDS) datasets, thus allowing them to be "also" instantiated (installed/uninstalled in DataLad land) under corresponding sourcedata/ within rawdata/ and derivatives/.

Note that a "study" even emerged naturally while preparing fmriprep Nature protocols paper, where there was $STUDY/{ds000003/,derivatives/} thus not sticking all derivatives within BIDS dataset. One approach to general (non-BIDS specific really) composition is YODA principles (see e.g. reused YODA figure in ReproNim/containers).

Linking+Provenance -- platform specific features should be avoided in BIDS but "acknowledged"

Decision on how to "compose" would affect "provenance" and thus possibility/fragility to any type of "linking" across datasets/modules. E.g. under YODA principles, all necessary components for dataset generation should be reachable "under" that dataset boundary/directory. So you could make a cut at $STUDY level and have everything to produce that study. You could take a derivative/* dataset and have everything to produce that derivative (by it having source BIDS datasets "referenced" and "instantiatable" under e.g. sourcedata/). IMHO BIDS is almost there (see above on composition and SourceDatasets) but it should embrace and promote such idiom more (instead of having it merely an alternative).

Aforementioned composition talks about "dataset(s)" level. Discussions on "symlinking" (e.g. relevant non-completed discussion in BEP001) probably could be addressed by

allowing derived data be placed alongside with "raw" (see above on "overlay"):
- either that a file is a regular file or a symlink (where file system allows) on a particular dataset instance should not matter to BIDS!
- tools reading BIDS datasets must not dereference and follow symlinks (that is something to annotate for in BIDS, ref: never finished PR)
- if distribution/archival platform and receiving file system allows for symlinks -- they could be used/preserved, if not -- de-referenced (either by distribution or by the receiving tool, needs investigation).
- the main point is that symlinks, if chosen to be used for "internal" to a dataset deployment structuring, should not cross boundaries of the dataset ("module" in YODA terms) in its distribution (tarball, git repo, etc).
provenance annotation on how any particular file was produced (generic provenance, applicable also to >90% of "raw" files)
referencing "subdatasets". common-derivatives already introduced SourceDatasets but it should be polished a bit more: IMHO it should not be just a list but an association for a folder (or subfolder(s) under sourcedata/).

Many of aforementioned aspects do not even need to wait for 2.0 IMHO, i.e. could be introduced in backward compatible way.

robertoostenveld commented 3 years ago

+1 for "everything is derived"!

The others are more subtle to simply give a positive or negative vote.

poldrack commented 3 years ago

on the one hand I agree that technically everything is derived. on the other hand, most researchers in the field a comfortable with the idea of "raw" data - i.e. the data as they are delivered by the measurement device. We need to balance technical accuracy in the terminology with usability - I think this issue definitely requires more discussion with users in addition to developers.

On Thu, Aug 27, 2020 at 12:04 AM Robert Oostenveld notifications@github.com wrote:

+1 for "everything is derived"!

The others are more subtle to simply give a positive or negative vote.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-2-devel/issues/43#issuecomment-681652339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUVEFBTLIWBMWC47FGWILSCYAQXANCNFSM4QBX7S6Q .

-- Russell A. Poldrack Albert Ray Lang Professor of Psychology Building 420 Stanford University Stanford, CA 94305

poldrack@stanford.edu http://www.poldracklab.org/

yarikoptic commented 2 months ago

@satra Given that we formalized operational definition of "derivative" to be a "BIDS dataset derived from other BIDS dataset(s)", could/should we consider this issue overall addressed? Note that we also have specific issues which IMHO relate such as

54
59

and others slated for BIDS 2.0 in https://github.com/orgs/bids-standard/projects/10 .

If not resolved/sufficiently covered by other issues -- what specific changes would you propose?

satra commented 1 month ago

@yarikoptic - i think the intent of this issue was primarily asking if some aspects of organization are participant/session/cohort/group specific. some of it would indeed benefit from simplify having the provenance, but others would need some notion of separating grouping of derivatives, e.g. something like a group average connectome would be different from individual connectomes. i think you note all of these in your response above, but i'm not sure they are mapped to specific other issues.

bids-standard / bids-2-devel