Replace "inheritance" with "summarization" principle

yarikoptic commented 6 months ago

It is a next step to the discussion which happened in

https://github.com/bids-standard/bids-2-devel/issues/36

On a recent road-trip with @effigies we briefly discussed it and so far did not see a show stopper but it would require more minds to analyze.

ATM one of the problems of inheritance principle is unclear semantic in case of a value to be modified down the hierarchy: order can be unclear in case of multiple "candidate" files, unclear how to "remove" a value, etc. And overall for a human it is cumbersome to "gather" the final value since for a file down the hierarchy someone needs to go through all possibly inherited files to arrive at the final value. But what if we take my suggestion in aforementioned issue further:

retain ability to "chain" candidates for metadata from higher to lower levels as in current inheritance principle
completely disallow overloading the value at lower (deeper in hierarchy) levels Corollaries:
- if present at different levels (e.g. entire dataset and then specific sidecar .json) - value must be identical/consistent across all levels of inheritance, or otherwise not given at any higher level
- if particular subject/session has some different value from the others as defined at higher (dataset) level, we need to remove that value from higher level and define at lower (e.g. subject/session) level

It will be a (now doable) job for a validator to ensure that all duplicated (across levels, if any) metadata is consistent.

As a result we would provide user a convenience that looking at top level metadata file provides a "guaranteed" correct metadata across all subject sessions, which is not the case currently as we can change it following the order of inheritance.

FWIW, we already do something like that in heudiconv, where top level task-*_bold.json files collate all identical values across subject/sessions -- makes it easy to see what is common (e.g. scanner ID etc)
Conceptually is what we have in BIDS ATM, e.g. participants.tsv summarizes metadata across participants and we expect it to be consistent with possible other phenotypic information to be found in subject/sessions.
- Hence I think it also relates to BEP036 (Phenotypic Data Guidelines), attn @surchs @ericearl (I just now created @bids-standard/bep036 team) where the idea circles to be able to "segregate" metadata into subject/session level while keeping consistently in the top level (under phetotype/ folder).
It somewhat would allow for easier composition of #59. Again -- metadata present on higher level would remain consistent with the lower, which would be easier to achieve (copy) and ensure (validator).

Attn @Lestropie as he has spent most time to improve Inheritance principle definition, and @dorahermes who is an active proponent and its user: do you think such "simplification" (removal of "value overload") of inheritance would simplify and remain usable? Or may be I do not see some common use case such additional "restriction" would disallow?

I think it might be worth writing some checker and apply it across all openneuro datasets to see if we run into such data "overloads". What would be a tool/functionality which implements inheritance principle already "closest to the bible", e.g. which pretty much would return a list of lists of .json/.tsv files in their "inherited" bundles? (specific code examples would be welcome)

Edit:

might cause trouble with #59 since we do need to overload value. But this file isn't really subject to inheritance principle in it's current formulation although is the information pertinent to all files

ericearl commented 6 months ago

@yarikoptic Thanks for this thoughtful issue and for the @bids-standard/bep036 team. I can do the check in a month or so for how inheritance is currently being used in OpenNeuro, thanks to Datalad's datalad clone ///openneuro-ability, of course!

I mostly like the idea above, except maybe I'm confused about one thing. Let's say up top there's a task-rest_bold.json with most of the parameters put out by dcm2niix. Then down below in 20 subject's func directories there's a disagreement with FlipAngle or EchoTime or both between these 20 and a separate 5 subject's task-rest bold JSONs. I believe you're saying that EVERY subject's JSON in this scenario has to have FlipAngle and EchoTime in it though only 5 subjects differ. dcm2niix doesn't care that you shouldn't duplicate fields at lower levels. So you end up with a need to filter most of your JSONs in the subjects func folders for common metadata... This might be difficult for many new users especially.

I think whether we move forward with either the inheritance principle OR the summarization principle, the call is for tools to support either of them. If one small set of tools could be created to support either, whichever one makes it to the gate first could be most easily adopted. This is why I've had creating a set of inheritance software tools has been on my BIDS maintainers desirables list for a long time. Thoughts?

yarikoptic commented 6 months ago

Let's say up top there's a task-rest_bold.json with most of the parameters put out by dcm2niix.

to be precise: it is not dcm2niix which places/creates such a file at the top level. It is a BIDS dataset "owner" who decides to take all or some fields from dcm2niix-produced sidecar .json for a specific .nii.gz and place that selection at the top of the dataset. So it is for some script/user to decide which fields to do copy to that file, dcm2niix is not really a "player" here.

I believe you're saying that EVERY subject's JSON in this scenario has to have FlipAngle and EchoTime in it though only 5 subjects differ.

Correct. Even if a single subject differs - such metadata should not be present at the level where it is not common for all levels below. Possible solutions:

remove that subject/section or scan as it is inconsistent with other scans for the task
if it is decided permissible to have such "divergent" acquisitions across subjects - contain such metadata only at the subject/session level (e.g. could have task-rest_bold.json on top level with all common, sub-X/task-rest_bold.json with subject specific, sub-X/ses-Y/task-rest_bold.json with subject/session specific, but must not be present with different values across levels).

So you end up with a need to filter most of your JSONs in the subjects func folders for common metadata... This might be difficult for many new users especially.

no -- new users just should not bother creating top level task-rest_bold.json with anything which is not common to all files underneath and they would stay "BIDS compliant". Then they could make use of some tools (e.g. here is function in heudiconv - populate_aggregated_jsons) to collect / rewrite top level .json files with only common metadata, or just write what they know is common (validator would verify that no conflicting/differing values present).

Re support -- correct, tools support would be needed... BUT "summarization" principle is just a more restricted case of inheritance if I see it right, so in principle any tool supporting current inheritance should work with "summarization" without any change.

An line of thought on .tsv + .json duality...

.tsv's we have are pretty much a case of summarization (as placing in a tabular structure within a single file) for entries where metadata could be different (e.g. age for a subject)... i.e. in participants.tsv we summarize commonalities and differences between participants. Overall we get {entitities}.tsv summarizing (flat list of) metadata fields typically (but not necessarily) different between values for that entity. In {entity}-{value}_{suffix}.json files we are providing what is common for that {value} (and paired with datatype {suffix}), and typically when we do not have such {entity}-{value}/ folder level separation (related #54), since then we would place common data and metadata under that folder. Overall "gather metadata for {entity} of {value}" algorithm should load metadata from {entities}.tsv, and all applicable {entity}-{value}*.json. Any inconsistency in values make "order of loading" important and thus possibly ambiguous. They also make it mandatory to read all the files to get the ultimate value, as opposed to the proposed here case -- first loaded value is "good enough" since they all must be the same: age of participant from participants.tsv should be consistent with any other age loaded from e.g. somewhere in phenotype/ (shhh about multiple sessions etc...)

Lestropie commented 6 months ago

completely disallow overloading the value at lower (deeper in hierarchy) levels.

I'm a fan. This would:

Simplify the description of the inheritance principle itself
Simplify software that wants to read all relevant sidecar data; they could just load all relevant JSONs in any directories in any order into a single dictionary
Be permissive of having multiple applicable sidecar files within a single directory level, which I would very much like to be able to utilise in BEP016
Be more faithful to the prospect of creating a piece of software that would analyse a BIDS dataset, identify sidecar data that is consistent and therefore promote it up the inheritance tree, thereby deferring utilisation of the inheritance principle entirely to software. (It sounds like you've already got a limited instance of this in heudiconv; I'd like for there to be something that does this across the board)

do you think such "simplification" (removal of "value overload") of inheritance would simplify and remain usable?

I would perhaps pose a different question. There's a bifurcation in opinions on the inheritance principle. I've personally been pushing for making it more powerful, which required improvement to the definition of current behaviour in order to facilitate the subsequent augmentation. Others would prefer that the whole principle disappear entirely, and all metadata relevant to a data file be present in the sidecar file.

The way I would therefore look at this proposal is: if the capacity for value overloading (specifically a present value at a higher level being overridden at a lower level) were to be removed, would this sway those previously opposed to the inheritance principle toward its preservation? So that's actually a question directed not at me but at others.

yarikoptic commented 5 months ago

@Lestropie, a birdie said that you might be participating in BIDS hackathon (if only virtually)? Would you be interested to work on this one. It can already be done as a PR against

https://github.com/bids-standard/bids-specification/pull/1775

similar to WiP I just started

https://github.com/bids-standard/bids-specification/pull/1809

Lestropie commented 5 months ago

I'm not aiming to participant in the hackathon and looking for a project so much as want to take on automating the use of inheritance and see the hackathon as a potential way to motivate project commencement and get other people on board. I want to write it up as a proposal somewhere, but wasn't quite sure where would be best: it's not yet guaranteed that I'll be able to do the Hackathon, and what I have in mind is also not specific to BIDS 2.0. Maybe I should create an empty repository and start listing issues there.

Lestropie commented 5 months ago

See https://github.com/Lestropie/IP-me/issues for my current intentions on the topic.

marcelzwiers commented 4 months ago

I know that it looks like that the general consensus is that we should keep or improve the inheritance/summarization principle. However, I have yet to encounter a single dataset in which I have found any use for this principle, but I have encountered several datasets in which this principle caused headaches, hard maintenance work and created ugly / hacky codebases. If it were up to me, I would through out the whole principle and always store the complete metadata with the data. It costs nearly nothing in terms of diskspace and I think it would make everybody's life easier. TLDR: choose the KISS principle, not the inheritance principle

marcelzwiers commented 4 months ago

The way I see it is that the inheritance principle comes down to implementing a poor man's solution for a relational database on the filesystem level

yarikoptic commented 4 months ago

@marcelzwiers the entire BIDS is "RDB on the filesystem level", so not surprising that pybids caches parsed structure in a local sql DB ;-)

re inheritance principle -- it is in heavy use everywhere, e.g. 20% of openneuro datasets use it for `*task-*_events.json` files

```shell $> for d in ds*; do ls -ld $d/*events.json 2>/dev/null | head -n1; done | nl 1 -rw-r----- 1 yoh datalad 204 Apr 27 2020 ds000006/task-livingnonlivingdecisionwithplainormirrorreversedtext_events.json 2 -rw------- 1 yoh datalad 128 Dec 2 2022 ds000031/events.json 3 -rw-r----- 1 yoh datalad 284 Dec 4 2018 ds000164/task-stroop_events.json 4 -rw-r----- 1 yoh datalad 596 Dec 4 2018 ds000214/task-Cyberball_events.json 5 -rw-r----- 1 yoh datalad 1879 Apr 27 2020 ds000217/task-picturetest_events.json 6 -rw-r----- 1 yoh datalad 857 Apr 27 2020 ds000223/task-mag_events.json 7 -rw-r----- 1 yoh datalad 738 Apr 27 2020 ds000249/task-genInstrAv_events.json 8 -rw-r--r-- 1 yoh datalad 1993 Aug 11 2020 ds001415/task-maplistening_events.json 9 -rw-r----- 1 yoh datalad 1193 Jan 25 2019 ds001499/task-5000scenes_events.json 10 -rw-r----- 1 yoh datalad 76 Dec 5 2018 ds001553/task-checkerboard_events.json 11 -rw-r----- 1 yoh datalad 316 Aug 13 2019 ds001590/task-loc_events.json 12 -rw-r----- 1 yoh datalad 869 Dec 5 2018 ds001597/task-cuedMFM_events.json 13 -rw-r----- 1 yoh datalad 567 Aug 20 2019 ds001608/task-viewclips_events.json 14 -rw-r----- 1 yoh datalad 528 Mar 18 2019 ds001740/task-convers_events.json 15 -rw-r----- 1 yoh datalad 231 Aug 12 2019 ds001771/task-identification_events.json 16 -rw-r----- 1 yoh datalad 739 Feb 26 2021 ds001785/task-adapt_events.json 17 -rw-r----- 1 yoh datalad 969 Mar 4 2021 ds001787/task-meditation_events.json 18 -rw-r----- 1 yoh datalad 1860 Aug 14 2019 ds001810/task-attentionalblink_events.json 19 -rw-r----- 1 yoh datalad 2641 Feb 25 2021 ds001814/task-ARC_events.json 20 -rw-r----- 1 yoh datalad 340 Feb 25 2021 ds001838/task-Adaptation_events.json 21 -rw-r----- 1 yoh datalad 903 Aug 20 2019 ds001840/task-viewclips_events.json 22 -rw-r----- 1 yoh datalad 1594 Jan 31 2022 ds001848/task-ParallelAdaptation_events.json 23 -rw-r----- 1 yoh datalad 1274 Aug 14 2019 ds001894/task-AANonWord_events.json 24 -rw-r----- 1 yoh datalad 2831 Aug 19 2019 ds001971/task-AudioCueWalkingStudy_events.json 25 -rw-r----- 1 yoh datalad 3110 Aug 20 2019 ds002011/task-Overlap_events.json 26 -rw-r----- 1 yoh datalad 410 Aug 20 2019 ds002013/task-CircCon_events.json 27 -rw-r----- 1 yoh datalad 428 Aug 20 2019 ds002033/task-ChangeDetection_events.json 28 -rw-r----- 1 yoh datalad 1671 Dec 3 2019 ds002041/task-TD_events.json 29 -rw-r----- 1 yoh datalad 340 Feb 25 2021 ds002116/task-Adaptation_events.json 30 -rw-r----- 1 yoh datalad 761 Feb 25 2021 ds002158/task-main_events.json 31 -rw-r----- 1 yoh datalad 649 Dec 3 2019 ds002185/task-odors_events.json 32 -rw-r----- 1 yoh datalad 604 Dec 3 2019 ds002218/task-Experiment_events.json 33 -rw-r----- 1 yoh datalad 1242 Apr 27 2020 ds002236/task-AudRhyme_events.json 34 -rw-r--r-- 1 yoh datalad 229 Jul 16 2020 ds002351/task-LDT_events.json 35 -rw-r----- 1 yoh datalad 925 Apr 27 2020 ds002366/task-emoregRun1_events.json 36 -rw-r----- 1 yoh datalad 924 Apr 27 2020 ds002411/task-ProgramCategorization_events.json 37 -rw-r----- 1 yoh datalad 798 Apr 27 2020 ds002419/task-taste1_events.json 38 -rw-r----- 1 yoh datalad 1400 May 4 2021 ds002424/task-SLD_events.json 39 -rw-r----- 1 yoh datalad 193 Apr 27 2020 ds002522/task-CRF_events.json 40 -rw------- 1 yoh datalad 1360 Dec 2 2022 ds002578/events.json 41 -rw-r----- 1 yoh datalad 1068 Apr 25 2022 ds002603/task-wm_events.json 42 -rw------- 1 yoh datalad 925 Apr 29 10:53 ds002620/task-emoregRun1_events.json 43 -rw-r----- 1 yoh datalad 123 Feb 25 2021 ds002634/task-ArtVoc_events.json 44 -rw-r----- 1 yoh datalad 696 Feb 25 2021 ds002647/task-IHG_events.json 45 -rw------- 1 yoh datalad 974 Nov 28 2023 ds002680/events.json 46 -rw-r----- 1 yoh datalad 1381 Apr 27 2020 ds002687/task-SLD_events.json 47 -rw------- 1 yoh datalad 349 Nov 28 2023 ds002691/events.json 48 -rw-r----- 1 yoh datalad 3151 Apr 25 2022 ds002718/task-FaceRecognition_events.json 49 -rw-r----- 1 yoh datalad 1596 Jan 31 2022 ds002738/task-reward_events.json 50 -rw------- 1 yoh datalad 977 Jan 19 2023 ds002761/task-loc_events.json 51 -rw-r----- 1 yoh datalad 772 May 13 2020 ds002776/task-motorseq_events.json 52 -rw-r----- 1 yoh datalad 423 Mar 4 2021 ds002785/task-anticipation_acq-seq_events.json 53 -rw-r----- 1 yoh datalad 2658 Mar 4 2021 ds002790/task-emomatching_acq-seq_events.json 54 -rw-r----- 1 yoh datalad 1224 Feb 25 2021 ds002813/task-fintest_events.json 55 -rw-r----- 1 yoh datalad 1898 Feb 25 2021 ds002835/task-prospection_events.json 56 -rw-r----- 1 yoh datalad 455 Jun 18 2021 ds002843/task-itc_events.json 57 -rw-r--r-- 1 yoh datalad 489 Jun 11 2020 ds002872/task-illusion_events.json 58 -rw-r--r-- 1 yoh datalad 1274 Jun 9 2020 ds002879/task-AANonWord_events.json 59 -rw-r--r-- 1 yoh datalad 2613 Jun 9 2020 ds002886/task-Syllogisms_events.json 60 -rw------- 1 yoh datalad 4504 Nov 28 2023 ds002893/task-AuditoryVisualShift_events.json 61 -rw-r--r-- 1 yoh datalad 804 Jun 17 2020 ds002894/task-languagelocalizer_events.json 62 -rw-r--r-- 1 yoh datalad 804 Jul 21 2020 ds002905/task-languagelocalizer_events.json 63 -rw-r--r-- 1 yoh datalad 1295 Jun 29 2020 ds002941/task-Mult_events.json 64 -rw------- 1 yoh datalad 1127 Dec 19 2022 ds002989/task-DDbid_events.json 65 -rw-r--r-- 1 yoh datalad 1082 Jul 8 2020 ds002995/task-tastemap_events.json 66 -rw-r--r-- 1 yoh datalad 1295 Aug 14 2020 ds003028/task-Mult_events.json 67 -rw------- 1 yoh datalad 2979 Aug 22 2022 ds003061/task-P300_events.json 68 -rw-r--r-- 1 yoh datalad 2613 Aug 14 2020 ds003076/task-Syllogisms_events.json 69 -rw-r--r-- 1 yoh datalad 1295 Sep 2 2020 ds003083/task-Mult_events.json 70 -rw-r----- 1 yoh datalad 66 Oct 22 2020 ds003136/task-affect_events.json 71 -rw-r----- 1 yoh datalad 1194 Oct 23 2020 ds003242/task-CIC_events.json 72 -rw-r----- 1 yoh datalad 1045 Jan 31 2022 ds003340/task-foodpicture_events.json 73 -rw-r----- 1 yoh datalad 324 Oct 23 2020 ds003342/task-grasp_events.json 74 -rw-r----- 1 yoh datalad 333 Mar 4 2021 ds003436/task-anim_events.json 75 -rw-r----- 1 yoh datalad 949 Feb 25 2021 ds003454/task-rapm_events.json 76 -rw-r----- 1 yoh datalad 1625 Feb 25 2021 ds003459/task-audortho_events.json 77 -rw-r----- 1 yoh datalad 613 May 4 2021 ds003487/task-PIT_events.json 78 -rw-r----- 1 yoh datalad 2661 Feb 25 2021 ds003495/task-emomatching_acq-seq_events.json 79 -rw-r----- 1 yoh datalad 1091 Jan 18 2022 ds003499/task-freq1_events.json 80 -rw-r----- 1 yoh datalad 2561 Feb 25 2021 ds003500/task-Conj19Sel_events.json 81 -rw-r----- 1 yoh datalad 2052 Mar 8 2021 ds003511/task-Recall_events.json 82 -rw-r----- 1 yoh datalad 2724 Jul 22 2021 ds003550/task-RepMem1_events.json 83 -rw-r----- 1 yoh datalad 1647 Jul 22 2021 ds003553/task-FacesHousesTE27_events.json 84 -rw-r----- 1 yoh datalad 2283 Jul 22 2021 ds003554/task-RepYo1_events.json 85 -rw-r--r-- 1 yoh datalad 229 Mar 18 2021 ds003569/task-LDT_events.json 86 -rw-r----- 1 yoh datalad 816 May 4 2021 ds003574/task-game_run-1_events.json 87 -rw-r----- 1 yoh datalad 2624 May 4 2021 ds003604/task-Gram_events.json 88 -rw-r----- 1 yoh datalad 8689 Jun 18 2021 ds003645/task-FacePerception_events.json 89 -rw------- 1 yoh datalad 1659 Apr 29 10:54 ds003684/task-dsp_events.json 90 -rw-r----- 1 yoh datalad 2371 Jul 22 2021 ds003703/task-listeningToSpeech_events.json 91 -rw-r----- 1 yoh datalad 1772 Jul 22 2021 ds003708/task-ccep_events.json 92 -rw-r----- 1 yoh datalad 801 Jul 22 2021 ds003711/events.json 93 -rw-r----- 1 yoh datalad 487 Jan 18 2022 ds003721/task-BI_events.json 94 -rw-r----- 1 yoh datalad 812 Jul 22 2021 ds003722/task-MIvsRest_events.json 95 -rw-r----- 1 yoh datalad 2036 Jan 18 2022 ds003758/task-beads_events.json 96 -rw-r----- 1 yoh datalad 1171 Jan 18 2022 ds003772/task-changepoint_events.json 97 -rw-r----- 1 yoh datalad 812 Jan 18 2022 ds003810/task-MIvsRest_events.json 98 -rw-r----- 1 yoh datalad 605 Jan 18 2022 ds003812/events.json 99 -rw------- 1 yoh datalad 1287 Aug 22 2022 ds003823/task-emotionRegulation_events.json 100 -rw-r----- 1 yoh datalad 1678 Jan 18 2022 ds003825/task-rsvp_events.json 101 -rw------- 1 yoh datalad 1318 Apr 29 10:53 ds003834/task-fam1back_events.json 102 -rw------- 1 yoh datalad 703 Apr 29 10:53 ds003835/events.json 103 -rw-r----- 1 yoh datalad 3346 Jan 18 2022 ds003846/task-PredError_events.json 104 -rw------- 1 yoh datalad 1290 Apr 29 10:54 ds003851/task-train_events.json 105 -rw-r----- 1 yoh datalad 3530 Jan 18 2022 ds003858/task-MID_events.json 106 -rw-r----- 1 yoh datalad 2425 Apr 25 2022 ds003965/task-face_events.json 107 -rw-r----- 1 yoh datalad 4770 Jan 31 2022 ds004010/task-MultisensoryDetectionTask_events.json 108 -rw------- 1 yoh datalad 520 Jan 4 14:22 ds004012/task-auditorystimuli_events.json 109 -rw-r----- 1 yoh datalad 894 Apr 25 2022 ds004018/task-rsvp_events.json 110 -rw-r----- 1 yoh datalad 793 Apr 25 2022 ds004073/task-PD_events.json 111 -rw------- 1 yoh datalad 2134 May 25 2023 ds004080/events.json 112 -rw------- 1 yoh datalad 853 Aug 22 2022 ds004086/task-RecogConf_events.json 113 -rw-r----- 1 yoh datalad 3473 Apr 25 2022 ds004091/task-AttendFixGazeCenterFS_events.json 114 -rw------- 1 yoh datalad 1928 Aug 22 2022 ds004094/task-induct_events.json 115 -rw------- 1 yoh datalad 6295 Aug 22 2022 ds004105/task-DriveRandomSound_events.json 116 -rw------- 1 yoh datalad 5092 Aug 22 2022 ds004106/task-GuardDuty_events.json 117 -rw------- 1 yoh datalad 3934 Aug 22 2022 ds004117/task-WorkingMemory_events.json 118 -rw------- 1 yoh datalad 5568 Aug 22 2022 ds004118/task-Drive_events.json 119 -rw------- 1 yoh datalad 4126 Aug 22 2022 ds004119/task-GuardDuty_events.json 120 -rw------- 1 yoh datalad 6013 Aug 22 2022 ds004120/task-DriveWithSpeedChange_events.json 121 -rw------- 1 yoh datalad 9492 Aug 22 2022 ds004121/task-DriveWithTaskAudio_events.json 122 -rw------- 1 yoh datalad 4475 Aug 22 2022 ds004122/task-Drive_events.json 123 -rw------- 1 yoh datalad 9928 Aug 22 2022 ds004123/task-DriveWithComplexity_events.json 124 -rw------- 1 yoh datalad 766 Aug 22 2022 ds004128/task-DG_events.json 125 -rw------- 1 yoh datalad 1227 Jan 19 2023 ds004192/task-things_events.json 126 -rw------- 1 yoh datalad 1302 Aug 22 2022 ds004194/task-prf_events.json 127 -rw------- 1 yoh datalad 2543 Aug 22 2022 ds004200/task-temporalscaling_events.json 128 -rw------- 1 yoh datalad 1089 May 26 2023 ds004212/task-main_events.json 129 -rw------- 1 yoh datalad 692 Dec 2 2022 ds004228/task-piper_events.json 130 -rw------- 1 yoh datalad 3473 May 25 2023 ds004271/task-AttendFixGazeCenterFS_events.json 131 -rw------- 1 yoh datalad 2242 Dec 19 2022 ds004295/task-task_events.json 132 -rw------- 1 yoh datalad 526 Dec 19 2022 ds004302/task-speech_events.json 133 -rw------- 1 yoh datalad 1045 May 25 2023 ds004312/task-foodpicture_events.json 134 -rw------- 1 yoh datalad 461 Aug 9 2023 ds004341/task-semenc_events.json 135 -rw------- 1 yoh datalad 1584 Dec 19 2022 ds004349/task-expo_events.json 136 -rw------- 1 yoh datalad 2904 Dec 19 2022 ds004350/task-LG_events.json 137 -rw------- 1 yoh datalad 2114 Dec 19 2022 ds004356/task-MusicvsSpeech_events.json 138 -rw------- 1 yoh datalad 2051 May 25 2023 ds004357/task-rsvp_events.json 139 -rw------- 1 yoh datalad 4665 Dec 19 2022 ds004362/task-motion_events.json 140 -rw------- 1 yoh datalad 1385 Dec 19 2022 ds004367/task-rdk_events.json 141 -rw------- 1 yoh datalad 2632 Dec 19 2022 ds004368/task-task_events.json 142 -rw------- 1 yoh datalad 977 May 25 2023 ds004398/task-loc_events.json 143 -rw------- 1 yoh datalad 406 May 26 2023 ds004400/events.json 144 -rw------- 1 yoh datalad 563 May 25 2023 ds004444/task-smrbmi_events.json 145 -rw------- 1 yoh datalad 563 May 25 2023 ds004446/task-smrbmi_events.json 146 -rw------- 1 yoh datalad 563 May 25 2023 ds004447/task-smrbmi_events.json 147 -rw------- 1 yoh datalad 563 May 25 2023 ds004448/task-smrbmi_events.json 148 -rw------- 1 yoh datalad 1772 May 25 2023 ds004457/task-ccep_events.json 149 -rw------- 1 yoh datalad 424 May 25 2023 ds004460/task-Rotation_events.json 150 -rw------- 1 yoh datalad 2 Aug 9 2023 ds004475/task-task_events.json 151 -rw------- 1 yoh datalad 110 May 25 2023 ds004488/task-action_events.json 152 -rw------- 1 yoh datalad 243 May 25 2023 ds004496/task-imagenet_events.json 153 -rw------- 1 yoh datalad 2 May 25 2023 ds004519/task-ProAntiCue_events.json 154 -rw------- 1 yoh datalad 2 May 25 2023 ds004520/task-Retrocue_events.json 155 -rw------- 1 yoh datalad 2 May 25 2023 ds004521/task-Postcues_events.json 156 -rw------- 1 yoh datalad 15794 May 25 2023 ds004532/task-PST_events.json 157 -rw------- 1 yoh datalad 1385 May 25 2023 ds004554/task-picturenaming_events.json 158 -rw------- 1 yoh datalad 676 May 25 2023 ds004556/task-feedback_events.json 159 -rw------- 1 yoh datalad 676 May 25 2023 ds004557/task-feedback_events.json 160 -rw------- 1 yoh datalad 1129 Sep 23 2023 ds004562/task-adaptation_events.json 161 -rw------- 1 yoh datalad 1854 Aug 9 2023 ds004563/task-touchdecoding_events.json 162 -rw------- 1 yoh datalad 523 May 25 2023 ds004574/task-Oddball_events.json 163 -rw------- 1 yoh datalad 740 May 25 2023 ds004575/task-IntervalTiming_events.json 164 -rw------- 1 yoh datalad 740 May 25 2023 ds004579/task-IntervalTiming_events.json 165 -rw------- 1 yoh datalad 479 May 25 2023 ds004580/task-Simon_events.json 166 -rw------- 1 yoh datalad 2 Jun 27 2023 ds004584/task-Rest_events.json 167 -rw------- 1 yoh datalad 2 Jun 1 2023 ds004588/task-unnamed_events.json 168 -rw------- 1 yoh datalad 1312 Jun 27 2023 ds004592/task-gradCPTface_events.json 169 -rw------- 1 yoh datalad 684 Aug 9 2023 ds004602/task-ERNPsychometrics_events.json 170 -rw------- 1 yoh datalad 687 Jun 27 2023 ds004606/task-msit_events.json 171 -rw------- 1 yoh datalad 840 Jun 27 2023 ds004609/task-msit_events.json 172 -rw------- 1 yoh datalad 840 Nov 28 2023 ds004621/task-msit_events.json 173 -rw------- 1 yoh datalad 937 Apr 29 10:55 ds004625/task-UnevenTerrain_events.json 174 -rw------- 1 yoh datalad 4579 Aug 9 2023 ds004626/task-DotProbe_events.json 175 -rw------- 1 yoh datalad 1541 Aug 9 2023 ds004635/task-resting_events.json 176 -rw------- 1 yoh datalad 2736 Nov 28 2023 ds004636/task-ANT_events.json 177 -rw------- 1 yoh datalad 45169 Nov 28 2023 ds004657/task-Drive_events.json 178 -rw------- 1 yoh datalad 7484 Nov 28 2023 ds004660/task-P300_events.json 179 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004661/task-nback_events.json 180 -rw------- 1 yoh datalad 3180 Sep 23 2023 ds004692/task-study_events.json 181 -rw------- 1 yoh datalad 1664 Jan 4 14:20 ds004724/task-antisaccade_events.json 182 -rw------- 1 yoh datalad 1929 Sep 23 2023 ds004745/task-unnamed_events.json 183 -rw------- 1 yoh datalad 706 Nov 28 2023 ds004746/task-paingen_events.json 184 -rw------- 1 yoh datalad 412 Nov 28 2023 ds004771/task-PY_events.json 185 -rw------- 1 yoh datalad 734 Jan 4 14:21 ds004784/task-phantom_events.json 186 -rw------- 1 yoh datalad 1385 Nov 28 2023 ds004785/task-unnamed_events.json 187 -rw------- 1 yoh datalad 687 Apr 29 10:57 ds004796/task-msit_events.json 188 -rw------- 1 yoh datalad 3900 Nov 28 2023 ds004802/task-roddball_events.json 189 -rw------- 1 yoh datalad 1287 Nov 28 2023 ds004816/task-rsvp_events.json 190 -rw------- 1 yoh datalad 1287 Nov 28 2023 ds004817/task-rsvp_events.json 191 -rw------- 1 yoh datalad 26908 Nov 28 2023 ds004841/task-DriveOnMission_events.json 192 -rw------- 1 yoh datalad 28251 Nov 28 2023 ds004842/task-DriveOnMission_events.json 193 -rw------- 1 yoh datalad 20064 Nov 28 2023 ds004843/task-VisualSituationalAwareness_events.json 194 -rw------- 1 yoh datalad 17863 Nov 28 2023 ds004844/task-Drive_events.json 195 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004849/task-nback_events.json 196 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004850/task-nback_events.json 197 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004851/task-nback_events.json 198 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004852/task-nback_events.json 199 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004853/task-nback_events.json 200 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004854/task-nback_events.json 201 -rw------- 1 yoh datalad 11772 Nov 28 2023 ds004855/task-nback_events.json 202 -rw------- 1 yoh datalad 1681 Nov 28 2023 ds004860/task-HarmN400_events.json 203 -rw------- 1 yoh datalad 506 Jan 4 14:22 ds004883/task-FFERN_events.json 204 -rw------- 1 yoh datalad 329 Apr 29 10:56 ds004894/events.json 205 -rw------- 1 yoh datalad 1664 Apr 29 10:54 ds004935/task-antisaccade_events.json 206 -rw------- 1 yoh datalad 1560 Apr 29 10:57 ds004942/task-SpatialMemory_events.json 207 -rw------- 1 yoh datalad 6783 Apr 29 11:00 ds005012/task-mid_events.json 208 -rw------- 1 yoh datalad 954 Apr 29 11:01 ds005021/task-tiltillusion_events.json ```

and as a "paradigm" it is pretty much is what participants.tsv, sessions.tsv etc are about -- summarization of metadata for underlying data in the hierarchy.

yarikoptic commented 4 months ago

But besides "paradigm" applicability, I am not sure I saw (but I never looked) application of it for .tsv files. @effigies @Lestropie are you aware of some good examples uses of inheritance for .tsv files?

Moreover inheritance principle is somewhat specific for .tsv and .bval/.bvec files in that there is no "inheritance" -- lowest level in hierarchy is taken (.json - accumulates from higher levels), and that plays better with "summarization".

effigies commented 4 months ago

Inheritance is baked into channels.tsv and electrodes.tsv, as these are generally expected to be constant within sessions, so they have fewer entities than the data files they apply to. We are having to add entities because some dataset curators want to duplicate them for every data file, which was not previously excluded by the validator. While this is allowing curators to decrease their reliance on the inheritance principle, for tools, it increases it, as they now must look for the same file in more potential locations.

For TSV files, I think the equivalent to the summarization principle would be that there must be exactly one applicable TSV file of a given type. So you could have a channels.tsv for each data file, but that would be mutually exclusive with one for the entire session. Likewise, you could have one task-nback_events.tsv at the root level, but then that must not be overridden by a specific run.

marcelzwiers commented 4 months ago

the entire BIDS is "RDB on the filesystem level"

I don't agree with that and as I see it, BIDS is a study format, nothing relational about it. True, for some data there are two locations for storing thing, either in the subject folder, or e.g. in the particpants.tsv file. But then you always choose, one or the other, there is never a relation between them (like there is a relation between jsons adhering to inheritance principle).

And the fact that 20% of the openneuro datasets use it is a description of the situation, not an argument for it's benefits :-) (it would have been trivially easy for these studies to store it all on the sub/ses level)

Moreover inheritance principle is somewhat specific for .tsv and .bval/.bvec files in that there is no "inheritance" -- lowest level in hierarchy is taken (.json - accumulates from higher levels), and that plays better with "summarization".

I agree with that, and I do support your "summarization" proposal as an improvement over the inheritance principle... It meets the goal of metadata deduplication, while reducing ambiguities and overly complex schemes, e.g. when pooling data

Remi-Gau commented 4 months ago

it would have been trivially easy for these studies to store it all on the sub/ses level

Trivial to some but not to all: remember that some of the people who create the datasets barely know what a json file is or how to interact with it with python or matlab.

So creating (and especially updating) a single file at the root of the dataset will be a lot easier for them than having to edit manually many many many json files.

marcelzwiers commented 4 months ago

Trivial to some but not to all: remember that some of the people who create the datasets barely know what a json file is or how to interact with it with python or matlab.

I actually fear for them dealing properly with the inheritance principle. I think it would be better to have such people use tools for editing/maintaining BIDS datasets, such as CuBIDS?

yarikoptic commented 4 months ago

And the fact that 20% of the openneuro datasets use it is a description of the situation, not an argument for it's benefits :-)

it was a response to your

However, I have yet to encounter a single dataset in which I have found any use for this principle

not an argument although it can easily become one if expanded, e.g. "I and others, as shown by above example, find it extremely useful". But this issue is not about that topic. If you would like to discuss inheritance principles cons, please chime in instead on

36

yarikoptic commented 4 months ago

Likewise, you could have one task-nback_events.tsv at the root level, but then that must not be overridden by a specific run.

In principle, I think this should be ok in "summarization" formulation as "overridden" would be replaced with "duplicated". In practice it would be tricky/impossible since for _events.tsv there is really nothing which could constitute the "identity" of an event, so unless an event row duplicated exactly, it would be just another added event (possibly for the same onset/duration but different metadata), so impossible to identify and to warn user that there might be inconsistency etc.

dorahermes commented 4 months ago

I completely agree with the above proposal as it eliminates value overloading (having a value at one level override a value at a different level).

Some notes from a BIDS curation perspective in a clinical environment. Non-technical staff does really well working with human readable files with simple rules and avoiding the use of additional software packages: either a file exists at the top, or at the individual level. This works most of the time. When we have to change a field at the individual level in hundreds of files, they often reach out to someone who can code.

One example use-case is the channels.tsv file for EEG/iEEG. This file exists for every data file and bad channels can be annotated there as they can differ across sessions and runs. The columns are the same across all subjects but include some optional user specified columns. If a channels.json file can exist only at the top level to specify these columns (that are the same across all subjects) that is convenient. The proposal described here would strongly facilitate this use case, which is extremely common for us, if I am correct.

effigies commented 4 months ago

I think we may be getting off-topic (feel free to hide this comment as off-topic if you agree), but I'm confused by the following:

In practice it would be tricky/impossible since for _events.tsv there is really nothing which could constitute the "identity" of an event, so unless an event row duplicated exactly, it would be just another added event (possibly for the same onset/duration but different metadata), so impossible to identify and to warn user that there might be inconsistency etc.

TSV files are not merged, they are located. Unless you are proposing this change, nobody would try to merge events.tsv files found at multiple levels. Given that you say it would be tricky/impossible, I don't think you're proposing it...

Now, TSV files can be joined, but those are specific ones. For example participants.tsv, sub-*_sessions.tsv and sub-*[_ses-*]_scans.tsv can be joined on the participant_id and session_id columns in order to provide metadata for each scan file, but this isn't a merging of two files with the same suffix.

yarikoptic commented 4 months ago

@dorahermes re channels.tsv -- could you elaborate more, may be point to example dataset? The

The columns are the same across all subjects but include some optional user specified columns.

sounds like requiring common columns provided at top level channels.tsv and then per subj/session sub-*_ses-*_channels.tsv providing additional columns... If I get it right , it would go "against" current inheritance rule and our discussion above with @effigies on that:

TSV files are not merged, they are located. Unless you are proposing this change, nobody would try to merge events.tsv files found at multiple levels. Given that you say it would be tricky/impossible, I don't think you're proposing it...

I am "considering" or "approaching" it ;-) And as @dorahermes points out above (if I got her right) we might want to not just "append" but "extend" (more like we do for json if we consider json to be a simple single row, and tsv is a list of such rows). Overall, I think it could be very beneficial if we could generalize principle so it doesn't differ in handling .tsv and .json files.

Note that if we have participant_id and session_id , we only have name and not channel_id within channels.tsv

yarikoptic commented 4 months ago

re _channels.tsv: a note that we do not force uniqueness on "name" of a channel. Also there is no entities for those suffixes such as channel and event, thus no _ids, name of which is {entitylongname}_id and value is {entityshortname}-{value} and already defined for

NB "edited" for difference in name/value

❯ grep '_id:$' objects/columns.yaml
desc_id:
participant_id:
sample_id:
session_id:

but I guess could be generalized for any entity (context: #54). So inheritance/summarization could be easily extended to support loading from multiple .tsv files "appending" (rows) and/or "extending" (columns) for files with _id columns ensuring alignment etc.

marcelzwiers commented 4 months ago

Another issue I haven't seen much discussion on (but correct me if I'm wrong, as I also missed the previous discussion on the inheritance cons, thank you @yarikoptic), is what I would call the file collection/grouping problem. So how to deal with e.g.:

[summarize.json]
sub-01
  `anat
     |-sub-01_run-1_acq-foo_MP2RAGE.nii.gz
     |-sub-01_run-1_acq-foo_UNIT1.nii.gz
     |-sub-01_run-2_acq-foo_MP2RAGE.nii.gz
     |-sub-01_run-2_acq-foo_UNIT1.nii.gz
     |-sub-01_acq-bar_MP2RAGE.nii.gz
     `-sub-01_acq-bar_UNIT1.nii.gz

How many summarize json sidecars would you make? Obviously, you would not make one for each run, but would you make one for each acq value? Would it be useful to have something like an IntendFor field in the summarize json (with support for wildcards, so you don't have to include an explicit list, but just the semantics. E.g. {"IntendedFor": "bids::sub-*/anat/sub-*_run-*_acq-foo_*.nii.gz"})?

Lestropie commented 4 months ago

... I am not sure I saw (but I never looked) application of it for .tsv files.... are you aware of some good examples uses of inheritance for .tsv files?

I don't deal with a wide breadth of different BIDS data from which to generate examples, but one that always irks me is complex DWI data. It is increasingly recommended to export magnitude & phase data for DWI as it facilitates superior denoising. In the absence of inheritance, this means that the diffusion gradient table (which is currently exclusively bvec / .bval rather than .tsv as originally requested, but could be .tsv following https://github.com/bids-standard/bids-specification/pull/352) would need to be exactly duplicated across the magnitude and phase component images. Defining these data once, omitting the _part-(mag|phase) entity, to me makes far more sense. But as with other discussion here, this is purely use of the IP (here: Inheritance Principle) to avoid duplication, not to supersede.

I know that it looks like that the general consensus is that we should keep or improve the inheritance/summarization principle.

The discussion on this Issue might skew differently to community opinion. I've been told on multiple occasions that there are many who would prefer for it to be removed entirely. I don't have my finger on the pulse on exactly what those proportions might look like.

One concern I have is that a naive community poll may skew toward removal because of a) an expectation of manual curation of such and b) consideration of raw BIDS data only, whereas community opinions following a) creation of a tool for automated application and b) consideration of the complexities of derivative data may yield a different result. So I'd like to at least create a compelling case.

So how to deal with e.g.:

All depends on the metadata contents; and more esoterically whether JSON files without suffices are permitted. At the extreme end, I could imagine:

sub-01.json containing all fields applicable to all images
sub-01/anat/sub-01_MP2RAGE.json containing any fields consistent across all _MP2RAGE images
sub-01/anat/sub-01_UNIT1.json containing any fields consistent across all _UNIT1 images (eg. units?)
sub-01/anat/sub-01_acq-foo.json containing any fields applicable only to acq-foo (ie. how it differs from acq-bar)
sub-01/anat/sub-01_acq-bar.json containing any fields applicable only to acq-bar (ie. how they differ from acq-foo)
sub-01/anat/sub-01_run-1_acq-foo.json containing any fields applicable only to run 1 of acq-bar (ie. what differs to run 2, maybe acquisition time?)
sub-01/anat/sub-01_run-2_acq-foo.json containing any fields applicable only to run 2 of acq-bar (ie. what differs to run 1)

This actually ends up with more metadata files than there are data files. But unlike exclusively using sidecars, it is immediately discoverable exactly what it is that differs between eg. entity-linked file collections acq-foo and acq-bar, by the contents of the respectively named metadata files. These may be more obscure use cases in the context of BIDS Raw, but in my experience with trying to develop complex BIDS Derivatives I think that cases like these are going to be increasingly prevalent in time.

Regardless of my own opinion, I don't see the debate progressing in an informed way in the absence of tangible examples of what data look like with vs. without the IP, or in the absence of software to use or not use the IP (complex examples like that above I would never expect a human to manually curate). Hence why I invested some time and effort in generating an Issue list for such: https://github.com/Lestropie/IP-freely/issues.

marcelzwiers commented 4 months ago

But unlike exclusively using sidecars, it is immediately discoverable exactly what it is that differs between eg. entity-linked file collections acq-foo and acq-bar, by the contents of the respectively named metadata files.

Yes, that's nice, but I think that this level of complexity just to deduplicate to the bitter end can be hard to grasp and would harm the acceptation / proper use of BIDS amongst the average neuroscientists. The inheritance principle makes things much less human readable and simple. For instance, I cannot just inspect a sidecar file anymore, I need tooling to search for data in the filetree hierarchy to get a complete view. So before deciding on a solution, I think we should clearly define who the users are that the inheritance principle tries to target? Is it the neuroscientist that manually edits/curates their BIDS data? Is it the programmer that makes BIDS-derivatives processing pipelines? And we need to consider if the benefits for one group of users really outweighs the downsides for the other users. I believe the summarize proposal of @yarikoptic is aimed as a middle ground?

Lestropie commented 4 months ago

Fully appreciate the argument for IP abolition. There's a good reason there's no consensus on the topic.

The question of "not just inspecting a sidecar ... needing tooling to search for (meta)data in the file tree hierarchy" has a natural converse, being something like "metadata are not unique ... need tooling to determine what data in the file tree hierarchy take the same values". There's complex relationships between metadata across data files regardless of how you cut it, it's a question of what types of operations you want to best facilitate.

What's landed me on the pro-IP side is that I'm further along in attempting to standardise complex derivatives. Consider the second of the two cases above. In a BIDS raw dataset, if two data files have the same value for some metadata field, that might be interesting, or it might not be. I would personally argue that it communicates the natural hierarchical nature of the data, but agree it comes with a complexity cost if stored explicitly in such a way. But with BIDS raw, data files are generally pretty independent of one another (with the exception of entity-linked file collections, which I'll come back to). With BIDS Derivatives, it will be more common for there to be more strongly linked file collections: a "singular" computational outcome is often by necessity spread across multiple data files. Here, shared metadata across sidecars is not mere happenstance or an opportunity for storage compression: the dataset would be considered corrupt were those metadata to not be exactly identical across data files. Moreover, within a dataset containing many files in a modality directory, human discernment of what data files encode the results of that particular computation vs. encode something else becomes increasingly difficult; a metadata file containing the relevant fields that is applicable via IP only to data files encoding the outcome of that computation would clearly communicate that grouping.

This is really just the existing entity-linked file collections concept, only more strongly asserted. Enhancing the IP, particularly by removing the restriction of one applicable metadata file per filesystem level, would greatly enhance this concept. Currently, there's no way to really "encode" an entity-linked file collection. Different data files may have more or less metadata fields that are equal or different between them, and more or less mutual vs. distinct entities, but it's all quite "fuzzy". Defining a metadata file that is applicable to multiple data files, containing only the mutually shared metadata fields, and named based on only the mutual entities, would be what defines that entity-linked file collection.

Also, given the principle is not a novel proposal for 2.0 but has stood throughout 1.x, I think there's a need for better tooling regardless of what happens for 2.0. Any software for reading / writing BIDS data really should by now be fully IP-aware. And I think there's moreover a need for software dedicated to the IP. I think having such tooling at hand might help inform that decision making process.

I think we should clearly define who the users are that the inheritance principle tries to target? Is it the neuroscientist that manually edits/curates their BIDS data? Is it the programmer that makes BIDS-derivatives processing pipelines?

For anyone doing exclusively manual curation of a BIDS dataset, I would expect that curation to almost exclusively omit the IP. Most commonly they'll be running something like dcm2niix, which gives a NIfTI & JSON per DICOM series, followed by filesystem-level renaming. Introducing IP usage would be more manual effort and only increase likelihood of errors. So the only case where someone might manually utilise the IP is if they are forced to define all of their metadata manually. Even in this scenario, use of the IP is not compulsory: if a user understands the principle and their data, they can exploit it; if not, they can omit it.

I think longer-term the more prevalent "users" will be App developers / those who interpret the outputs from those Apps. Writing shared metadata once to one file, appropriately named, is slightly more concise in code than having a base shared dictionary and duplicating it with minor changes across multiple output metadata files, though it's a pretty subtle difference. To me it's moreso about communication of the relationships between data files. For a BIDS Derivatives dataset, not all data files in a modality directory are equally distinct from one another; some are more strongly related than others, and the IP is one way of communicating those relationships.

I believe the summarize proposal of @yarikoptic is aimed as a middle ground?

I think that proposing to change the name of the principle may be misleading as to the scope of that change. The proposal is only to forbid having some data file with multiple applicable metadata files where some field takes different values across such files. That I think would be an unambiguously good change, would simplify both lay and systematised descriptions of the principle, and would be more algorithmically compatible with automated approaches. But it wouldn't resolve any of the concerns you have yourself raised here.

yarikoptic commented 4 months ago

Attn @dorahermes @Lestropie and others -- if in general you consider this issue/idea good -- please upvote by :+1: . If you consider it a bad idea -- downvote with :-1: .

I would appreciate if general discussion of IP "disadvantages" would be discussed elsewhere, e.g. in the issue #36 if you feel strongly that IP "must die". But as far as I see it, the IP is to stay in some form, which might potentially remove some aspects (e.g. as the summarization here removing overloading), and/or be enriched with additional tooling or principles (alike IntededFor for groupping). For those I would also advise to start projects (like @Lestropie did) or other issues and cross-link back here.

In this issue, and later in a PR against bids-2.0 branch, I would appreciate more specific/targetted feedback or assistance with this idea. E.g.

I might wait for @Lestropie 's tool and then would love to use it (or see someone contributing) to implement that checker of openneuro datasets.
may be you have some use-cases where summarization would open opportunities?
may be you want even to take a stab at formalizing it in a PR?

bids-standard / bids-2-devel

Replace "inheritance" with "summarization" principle #65

An line of thought on .tsv + .json duality...

36