Open yarikoptic opened 6 months ago
@yarikoptic Thanks for this thoughtful issue and for the @bids-standard/bep036 team. I can do the check in a month or so for how inheritance is currently being used in OpenNeuro, thanks to Datalad's datalad clone ///openneuro
-ability, of course!
I mostly like the idea above, except maybe I'm confused about one thing. Let's say up top there's a task-rest_bold.json
with most of the parameters put out by dcm2niix. Then down below in 20 subject's func directories there's a disagreement with FlipAngle or EchoTime or both between these 20 and a separate 5 subject's task-rest bold JSONs. I believe you're saying that EVERY subject's JSON in this scenario has to have FlipAngle and EchoTime in it though only 5 subjects differ. dcm2niix doesn't care that you shouldn't duplicate fields at lower levels. So you end up with a need to filter most of your JSONs in the subjects func folders for common metadata... This might be difficult for many new users especially.
I think whether we move forward with either the inheritance principle OR the summarization principle, the call is for tools to support either of them. If one small set of tools could be created to support either, whichever one makes it to the gate first could be most easily adopted. This is why I've had creating a set of inheritance software tools has been on my BIDS maintainers desirables list for a long time. Thoughts?
Let's say up top there's a task-rest_bold.json with most of the parameters put out by dcm2niix.
to be precise: it is not dcm2niix
which places/creates such a file at the top level. It is a BIDS dataset "owner" who decides to take all or some fields from dcm2niix-produced sidecar .json for a specific .nii.gz and place that selection at the top of the dataset. So it is for some script/user to decide which fields to do copy to that file, dcm2niix
is not really a "player" here.
I believe you're saying that EVERY subject's JSON in this scenario has to have FlipAngle and EchoTime in it though only 5 subjects differ.
Correct. Even if a single subject differs - such metadata should not be present at the level where it is not common for all levels below. Possible solutions:
task-rest_bold.json
on top level with all common, sub-X/task-rest_bold.json
with subject specific, sub-X/ses-Y/task-rest_bold.json
with subject/session specific, but must not be present with different values across levels).So you end up with a need to filter most of your JSONs in the subjects func folders for common metadata... This might be difficult for many new users especially.
no -- new users just should not bother creating top level task-rest_bold.json
with anything which is not common to all files underneath and they would stay "BIDS compliant". Then they could make use of some tools (e.g. here is function in heudiconv - populate_aggregated_jsons) to collect / rewrite top level .json files with only common metadata, or just write what they know is common (validator would verify that no conflicting/differing values present).
Re support -- correct, tools support would be needed... BUT "summarization" principle is just a more restricted case of inheritance if I see it right, so in principle any tool supporting current inheritance should work with "summarization" without any change.
.tsv's we have are pretty much a case of summarization (as placing in a tabular structure within a single file) for entries where metadata could be different (e.g. age for a subject)... i.e. in participants.tsv
we summarize commonalities and differences between participants. Overall we get {entitities}.tsv
summarizing (flat list of) metadata fields typically (but not necessarily) different between values for that entity
. In {entity}-{value}_{suffix}.json
files we are providing what is common for that {value}
(and paired with datatype {suffix}
), and typically when we do not have such {entity}-{value}/
folder level separation (related #54), since then we would place common data and metadata under that folder.
Overall "gather metadata for {entity} of {value}" algorithm should load metadata from {entities}.tsv
, and all applicable {entity}-{value}*.json
. Any inconsistency in values make "order of loading" important and thus possibly ambiguous. They also make it mandatory to read all the files to get the ultimate value, as opposed to the proposed here case -- first loaded value is "good enough" since they all must be the same: age of participant from participants.tsv
should be consistent with any other age loaded from e.g. somewhere in phenotype/
(shhh about multiple sessions etc...)
completely disallow overloading the value at lower (deeper in hierarchy) levels.
I'm a fan. This would:
do you think such "simplification" (removal of "value overload") of inheritance would simplify and remain usable?
I would perhaps pose a different question. There's a bifurcation in opinions on the inheritance principle. I've personally been pushing for making it more powerful, which required improvement to the definition of current behaviour in order to facilitate the subsequent augmentation. Others would prefer that the whole principle disappear entirely, and all metadata relevant to a data file be present in the sidecar file.
The way I would therefore look at this proposal is: if the capacity for value overloading (specifically a present value at a higher level being overridden at a lower level) were to be removed, would this sway those previously opposed to the inheritance principle toward its preservation? So that's actually a question directed not at me but at others.
@Lestropie, a birdie said that you might be participating in BIDS hackathon (if only virtually)? Would you be interested to work on this one. It can already be done as a PR against
similar to WiP I just started
I'm not aiming to participant in the hackathon and looking for a project so much as want to take on automating the use of inheritance and see the hackathon as a potential way to motivate project commencement and get other people on board. I want to write it up as a proposal somewhere, but wasn't quite sure where would be best: it's not yet guaranteed that I'll be able to do the Hackathon, and what I have in mind is also not specific to BIDS 2.0. Maybe I should create an empty repository and start listing issues there.
See https://github.com/Lestropie/IP-me/issues for my current intentions on the topic.
I know that it looks like that the general consensus is that we should keep or improve the inheritance/summarization principle. However, I have yet to encounter a single dataset in which I have found any use for this principle, but I have encountered several datasets in which this principle caused headaches, hard maintenance work and created ugly / hacky codebases. If it were up to me, I would through out the whole principle and always store the complete metadata with the data. It costs nearly nothing in terms of diskspace and I think it would make everybody's life easier. TLDR: choose the KISS principle, not the inheritance principle
The way I see it is that the inheritance principle comes down to implementing a poor man's solution for a relational database on the filesystem level
@marcelzwiers the entire BIDS is "RDB on the filesystem level", so not surprising that pybids caches parsed structure in a local sql DB ;-)
and as a "paradigm" it is pretty much is what participants.tsv
, sessions.tsv
etc are about -- summarization of metadata for underlying data in the hierarchy.
But besides "paradigm" applicability, I am not sure I saw (but I never looked) application of it for .tsv
files. @effigies @Lestropie are you aware of some good examples uses of inheritance for .tsv files?
Moreover inheritance principle is somewhat specific for .tsv and .bval/.bvec files in that there is no "inheritance" -- lowest level in hierarchy is taken (.json - accumulates from higher levels), and that plays better with "summarization".
Inheritance is baked into channels.tsv
and electrodes.tsv
, as these are generally expected to be constant within sessions, so they have fewer entities than the data files they apply to. We are having to add entities because some dataset curators want to duplicate them for every data file, which was not previously excluded by the validator. While this is allowing curators to decrease their reliance on the inheritance principle, for tools, it increases it, as they now must look for the same file in more potential locations.
For TSV files, I think the equivalent to the summarization principle would be that there must be exactly one applicable TSV file of a given type. So you could have a channels.tsv
for each data file, but that would be mutually exclusive with one for the entire session. Likewise, you could have one task-nback_events.tsv
at the root level, but then that must not be overridden by a specific run.
the entire BIDS is "RDB on the filesystem level"
I don't agree with that and as I see it, BIDS is a study format, nothing relational about it. True, for some data there are two locations for storing thing, either in the subject folder, or e.g. in the particpants.tsv file. But then you always choose, one or the other, there is never a relation between them (like there is a relation between jsons adhering to inheritance principle).
And the fact that 20% of the openneuro datasets use it is a description of the situation, not an argument for it's benefits :-) (it would have been trivially easy for these studies to store it all on the sub/ses level)
Moreover inheritance principle is somewhat specific for .tsv and .bval/.bvec files in that there is no "inheritance" -- lowest level in hierarchy is taken (.json - accumulates from higher levels), and that plays better with "summarization".
I agree with that, and I do support your "summarization" proposal as an improvement over the inheritance principle... It meets the goal of metadata deduplication, while reducing ambiguities and overly complex schemes, e.g. when pooling data
it would have been trivially easy for these studies to store it all on the sub/ses level
Trivial to some but not to all: remember that some of the people who create the datasets barely know what a json file is or how to interact with it with python or matlab.
So creating (and especially updating) a single file at the root of the dataset will be a lot easier for them than having to edit manually many many many json files.
Trivial to some but not to all: remember that some of the people who create the datasets barely know what a json file is or how to interact with it with python or matlab.
I actually fear for them dealing properly with the inheritance principle. I think it would be better to have such people use tools for editing/maintaining BIDS datasets, such as CuBIDS?
And the fact that 20% of the openneuro datasets use it is a description of the situation, not an argument for it's benefits :-)
it was a response to your
However, I have yet to encounter a single dataset in which I have found any use for this principle
not an argument although it can easily become one if expanded, e.g. "I and others, as shown by above example, find it extremely useful". But this issue is not about that topic. If you would like to discuss inheritance principles cons, please chime in instead on
Likewise, you could have one
task-nback_events.tsv
at the root level, but then that must not be overridden by a specific run.
In principle, I think this should be ok in "summarization" formulation as "overridden" would be replaced with "duplicated". In practice it would be tricky/impossible since for _events.tsv
there is really nothing which could constitute the "identity" of an event, so unless an event row duplicated exactly, it would be just another added event (possibly for the same onset/duration but different metadata), so impossible to identify and to warn user that there might be inconsistency etc.
I completely agree with the above proposal as it eliminates value overloading (having a value at one level override a value at a different level).
Some notes from a BIDS curation perspective in a clinical environment. Non-technical staff does really well working with human readable files with simple rules and avoiding the use of additional software packages: either a file exists at the top, or at the individual level. This works most of the time. When we have to change a field at the individual level in hundreds of files, they often reach out to someone who can code.
One example use-case is the channels.tsv file for EEG/iEEG. This file exists for every data file and bad channels can be annotated there as they can differ across sessions and runs. The columns are the same across all subjects but include some optional user specified columns. If a channels.json file can exist only at the top level to specify these columns (that are the same across all subjects) that is convenient. The proposal described here would strongly facilitate this use case, which is extremely common for us, if I am correct.
I think we may be getting off-topic (feel free to hide this comment as off-topic if you agree), but I'm confused by the following:
In practice it would be tricky/impossible since for
_events.tsv
there is really nothing which could constitute the "identity" of an event, so unless an event row duplicated exactly, it would be just another added event (possibly for the same onset/duration but different metadata), so impossible to identify and to warn user that there might be inconsistency etc.
TSV files are not merged, they are located. Unless you are proposing this change, nobody would try to merge events.tsv files found at multiple levels. Given that you say it would be tricky/impossible, I don't think you're proposing it...
Now, TSV files can be joined, but those are specific ones. For example participants.tsv
, sub-*_sessions.tsv
and sub-*[_ses-*]_scans.tsv
can be joined on the participant_id
and session_id
columns in order to provide metadata for each scan file, but this isn't a merging of two files with the same suffix.
@dorahermes re channels.tsv
-- could you elaborate more, may be point to example dataset? The
The columns are the same across all subjects but include some optional user specified columns.
sounds like requiring common columns provided at top level channels.tsv
and then per subj/session sub-*_ses-*_channels.tsv
providing additional columns... If I get it right , it would go "against" current inheritance rule and our discussion above with @effigies on that:
TSV files are not merged, they are located. Unless you are proposing this change, nobody would try to merge events.tsv files found at multiple levels. Given that you say it would be tricky/impossible, I don't think you're proposing it...
I am "considering" or "approaching" it ;-) And as @dorahermes points out above (if I got her right) we might want to not just "append" but "extend" (more like we do for json if we consider json to be a simple single row, and tsv is a list of such rows). Overall, I think it could be very beneficial if we could generalize principle so it doesn't differ in handling .tsv and .json files.
Note that if we have participant_id
and session_id
, we only have name
and not channel_id
within channels.tsv
re _channels.tsv
: a note that we do not force uniqueness on "name" of a channel. Also there is no entities for those suffixes such as channel
and event
, thus no _id
s, name of which is {entitylongname}_id
and value is {entityshortname}-{value}
and already defined for
NB "edited" for difference in name/value
❯ grep '_id:$' objects/columns.yaml
desc_id:
participant_id:
sample_id:
session_id:
but I guess could be generalized for any entity (context: #54).
So inheritance/summarization could be easily extended to support loading from multiple .tsv files "appending" (rows) and/or "extending" (columns) for files with _id
columns ensuring alignment etc.
Another issue I haven't seen much discussion on (but correct me if I'm wrong, as I also missed the previous discussion on the inheritance cons, thank you @yarikoptic), is what I would call the file collection/grouping problem. So how to deal with e.g.:
[summarize.json]
sub-01
`anat
|-sub-01_run-1_acq-foo_MP2RAGE.nii.gz
|-sub-01_run-1_acq-foo_UNIT1.nii.gz
|-sub-01_run-2_acq-foo_MP2RAGE.nii.gz
|-sub-01_run-2_acq-foo_UNIT1.nii.gz
|-sub-01_acq-bar_MP2RAGE.nii.gz
`-sub-01_acq-bar_UNIT1.nii.gz
How many summarize json sidecars would you make? Obviously, you would not make one for each run, but would you make one for each acq
value? Would it be useful to have something like an IntendFor
field in the summarize json (with support for wildcards, so you don't have to include an explicit list, but just the semantics. E.g. {"IntendedFor": "bids::sub-*/anat/sub-*_run-*_acq-foo_*.nii.gz"}
)?
... I am not sure I saw (but I never looked) application of it for .tsv files.... are you aware of some good examples uses of inheritance for .tsv files?
I don't deal with a wide breadth of different BIDS data from which to generate examples, but one that always irks me is complex DWI data. It is increasingly recommended to export magnitude & phase data for DWI as it facilitates superior denoising. In the absence of inheritance, this means that the diffusion gradient table (which is currently exclusively bvec
/ .bval
rather than .tsv
as originally requested, but could be .tsv
following https://github.com/bids-standard/bids-specification/pull/352) would need to be exactly duplicated across the magnitude and phase component images. Defining these data once, omitting the _part-(mag|phase)
entity, to me makes far more sense. But as with other discussion here, this is purely use of the IP (here: Inheritance Principle) to avoid duplication, not to supersede.
I know that it looks like that the general consensus is that we should keep or improve the inheritance/summarization principle.
The discussion on this Issue might skew differently to community opinion. I've been told on multiple occasions that there are many who would prefer for it to be removed entirely. I don't have my finger on the pulse on exactly what those proportions might look like.
One concern I have is that a naive community poll may skew toward removal because of a) an expectation of manual curation of such and b) consideration of raw BIDS data only, whereas community opinions following a) creation of a tool for automated application and b) consideration of the complexities of derivative data may yield a different result. So I'd like to at least create a compelling case.
So how to deal with e.g.:
All depends on the metadata contents; and more esoterically whether JSON files without suffices are permitted. At the extreme end, I could imagine:
sub-01.json
containing all fields applicable to all imagessub-01/anat/sub-01_MP2RAGE.json
containing any fields consistent across all _MP2RAGE
imagessub-01/anat/sub-01_UNIT1.json
containing any fields consistent across all _UNIT1
images (eg. units?)sub-01/anat/sub-01_acq-foo.json
containing any fields applicable only to acq-foo
(ie. how it differs from acq-bar
)sub-01/anat/sub-01_acq-bar.json
containing any fields applicable only to acq-bar
(ie. how they differ from acq-foo
)sub-01/anat/sub-01_run-1_acq-foo.json
containing any fields applicable only to run 1 of acq-bar
(ie. what differs to run 2, maybe acquisition time?)sub-01/anat/sub-01_run-2_acq-foo.json
containing any fields applicable only to run 2 of acq-bar
(ie. what differs to run 1)This actually ends up with more metadata files than there are data files. But unlike exclusively using sidecars, it is immediately discoverable exactly what it is that differs between eg. entity-linked file collections acq-foo
and acq-bar
, by the contents of the respectively named metadata files. These may be more obscure use cases in the context of BIDS Raw, but in my experience with trying to develop complex BIDS Derivatives I think that cases like these are going to be increasingly prevalent in time.
Regardless of my own opinion, I don't see the debate progressing in an informed way in the absence of tangible examples of what data look like with vs. without the IP, or in the absence of software to use or not use the IP (complex examples like that above I would never expect a human to manually curate). Hence why I invested some time and effort in generating an Issue list for such: https://github.com/Lestropie/IP-freely/issues.
But unlike exclusively using sidecars, it is immediately discoverable exactly what it is that differs between eg. entity-linked file collections acq-foo and acq-bar, by the contents of the respectively named metadata files.
Yes, that's nice, but I think that this level of complexity just to deduplicate to the bitter end can be hard to grasp and would harm the acceptation / proper use of BIDS amongst the average neuroscientists. The inheritance principle makes things much less human readable and simple. For instance, I cannot just inspect a sidecar file anymore, I need tooling to search for data in the filetree hierarchy to get a complete view. So before deciding on a solution, I think we should clearly define who the users are that the inheritance principle tries to target? Is it the neuroscientist that manually edits/curates their BIDS data? Is it the programmer that makes BIDS-derivatives processing pipelines? And we need to consider if the benefits for one group of users really outweighs the downsides for the other users. I believe the summarize proposal of @yarikoptic is aimed as a middle ground?
Fully appreciate the argument for IP abolition. There's a good reason there's no consensus on the topic.
The question of "not just inspecting a sidecar ... needing tooling to search for (meta)data in the file tree hierarchy" has a natural converse, being something like "metadata are not unique ... need tooling to determine what data in the file tree hierarchy take the same values". There's complex relationships between metadata across data files regardless of how you cut it, it's a question of what types of operations you want to best facilitate.
What's landed me on the pro-IP side is that I'm further along in attempting to standardise complex derivatives. Consider the second of the two cases above. In a BIDS raw dataset, if two data files have the same value for some metadata field, that might be interesting, or it might not be. I would personally argue that it communicates the natural hierarchical nature of the data, but agree it comes with a complexity cost if stored explicitly in such a way. But with BIDS raw, data files are generally pretty independent of one another (with the exception of entity-linked file collections, which I'll come back to). With BIDS Derivatives, it will be more common for there to be more strongly linked file collections: a "singular" computational outcome is often by necessity spread across multiple data files. Here, shared metadata across sidecars is not mere happenstance or an opportunity for storage compression: the dataset would be considered corrupt were those metadata to not be exactly identical across data files. Moreover, within a dataset containing many files in a modality directory, human discernment of what data files encode the results of that particular computation vs. encode something else becomes increasingly difficult; a metadata file containing the relevant fields that is applicable via IP only to data files encoding the outcome of that computation would clearly communicate that grouping.
This is really just the existing entity-linked file collections concept, only more strongly asserted. Enhancing the IP, particularly by removing the restriction of one applicable metadata file per filesystem level, would greatly enhance this concept. Currently, there's no way to really "encode" an entity-linked file collection. Different data files may have more or less metadata fields that are equal or different between them, and more or less mutual vs. distinct entities, but it's all quite "fuzzy". Defining a metadata file that is applicable to multiple data files, containing only the mutually shared metadata fields, and named based on only the mutual entities, would be what defines that entity-linked file collection.
Also, given the principle is not a novel proposal for 2.0 but has stood throughout 1.x, I think there's a need for better tooling regardless of what happens for 2.0. Any software for reading / writing BIDS data really should by now be fully IP-aware. And I think there's moreover a need for software dedicated to the IP. I think having such tooling at hand might help inform that decision making process.
I think we should clearly define who the users are that the inheritance principle tries to target? Is it the neuroscientist that manually edits/curates their BIDS data? Is it the programmer that makes BIDS-derivatives processing pipelines?
For anyone doing exclusively manual curation of a BIDS dataset, I would expect that curation to almost exclusively omit the IP. Most commonly they'll be running something like dcm2niix
, which gives a NIfTI & JSON per DICOM series, followed by filesystem-level renaming. Introducing IP usage would be more manual effort and only increase likelihood of errors. So the only case where someone might manually utilise the IP is if they are forced to define all of their metadata manually. Even in this scenario, use of the IP is not compulsory: if a user understands the principle and their data, they can exploit it; if not, they can omit it.
I think longer-term the more prevalent "users" will be App developers / those who interpret the outputs from those Apps. Writing shared metadata once to one file, appropriately named, is slightly more concise in code than having a base shared dictionary and duplicating it with minor changes across multiple output metadata files, though it's a pretty subtle difference. To me it's moreso about communication of the relationships between data files. For a BIDS Derivatives dataset, not all data files in a modality directory are equally distinct from one another; some are more strongly related than others, and the IP is one way of communicating those relationships.
I believe the summarize proposal of @yarikoptic is aimed as a middle ground?
I think that proposing to change the name of the principle may be misleading as to the scope of that change. The proposal is only to forbid having some data file with multiple applicable metadata files where some field takes different values across such files. That I think would be an unambiguously good change, would simplify both lay and systematised descriptions of the principle, and would be more algorithmically compatible with automated approaches. But it wouldn't resolve any of the concerns you have yourself raised here.
Attn @dorahermes @Lestropie and others -- if in general you consider this issue/idea good -- please upvote by :+1: . If you consider it a bad idea -- downvote with :-1: .
I would appreciate if general discussion of IP "disadvantages" would be discussed elsewhere, e.g. in the issue #36 if you feel strongly that IP "must die". But as far as I see it, the IP is to stay in some form, which might potentially remove some aspects (e.g. as the summarization here removing overloading), and/or be enriched with additional tooling or principles (alike IntededFor for groupping). For those I would also advise to start projects (like @Lestropie did) or other issues and cross-link back here.
In this issue, and later in a PR against bids-2.0 branch, I would appreciate more specific/targetted feedback or assistance with this idea. E.g.
It is a next step to the discussion which happened in
On a recent road-trip with @effigies we briefly discussed it and so far did not see a show stopper but it would require more minds to analyze.
ATM one of the problems of inheritance principle is unclear semantic in case of a value to be modified down the hierarchy: order can be unclear in case of multiple "candidate" files, unclear how to "remove" a value, etc. And overall for a human it is cumbersome to "gather" the final value since for a file down the hierarchy someone needs to go through all possibly inherited files to arrive at the final value. But what if we take my suggestion in aforementioned issue further:
It will be a (now doable) job for a validator to ensure that all duplicated (across levels, if any) metadata is consistent.
As a result we would provide user a convenience that looking at top level metadata file provides a "guaranteed" correct metadata across all subject sessions, which is not the case currently as we can change it following the order of inheritance.
task-*_bold.json
files collate all identical values across subject/sessions -- makes it easy to see what is common (e.g. scanner ID etc)participants.tsv
summarizes metadata across participants and we expect it to be consistent with possible other phenotypic information to be found in subject/sessions.phetotype/
folder).Attn @Lestropie as he has spent most time to improve Inheritance principle definition, and @dorahermes who is an active proponent and its user: do you think such "simplification" (removal of "value overload") of inheritance would simplify and remain usable? Or may be I do not see some common use case such additional "restriction" would disallow?
I think it might be worth writing some checker and apply it across all openneuro datasets to see if we run into such data "overloads". What would be a tool/functionality which implements inheritance principle already "closest to the bible", e.g. which pretty much would return a list of lists of .json/.tsv files in their "inherited" bundles? (specific code examples would be welcome)
Edit: