Open tsalo opened 4 years ago
@yarikoptic wrote:
Inheritance is the weakest part of BIDS and opens the door to a host of problems.
it is indeed weak but also quite powerful and descriptive since e.g. makes it possible to quickly assess details for tasks from top level task-*json files. What makes it weak IMHO is lack of clear procedural description of how it applies. See https://github.com/bids-standard/bids-specification/issues/102 on which unfortunately nobody chimed in
+100 on clarifying inheritance "procedural description" (instead of removing).
I am highly in favor of removing this principle. At one point I had offered to attempt to program in some of the inheritance functionality, but quickly found that it's actually quite complicated to implement correctly. You can see where this was discussed briefly: https://github.com/bids-standard/pybids/issues/550. I had offered back-channel to help @leej3 with this but unfortunately didn't follow through due to a combination of business and difficulty, and had forgotten until now. Additionally, I have had conversations with several other developers who have found this a frustrating bottleneck for retrieving information. (Sorry @leej3 !) I also agree with @ccraddock that it is quite likely people will want to cherry-pick subjects who match their needs rather than take entire projects. An additional point is that while in theory it is nice to have a "universal" .json file, it's possible that things go wrong at consoles, people type incorrectly, and subjects differ. In my view it is best to accept each scan as having its own acquisition parameters and instead pulling them directly. Projects which require conformance to a strict standard can easily check for that conformance if the sidecars are required to be present, since they have to harvest each file anyway.
I don't consider inheritance so important for machine-written JSON files pertaining to raw data (e.g. by dcm2niix) which can easily be replicated but do think they are of value for human-written data dictionaries. An example of that is the events.json
that can accompany the events.tsv
to explain how the values in the columns are coded. Or a scans.json
to explain the non-default columns in the scans.tsv
files.
See e.g. https://github.com/bids-standard/bids-examples/blob/master/eeg_matchingpennies/task-matchingpennies_events.json. Consider that you would have to worry about that potentially being different for each and every subject/session/run, that would make subsequent analysis quite challenging.
I think that the suggestion to remove inheritance from the structure seems a good direction to go.
I think this topic came up a bit briefly towards the end of a hackathon, and we discussed how this seemed to turn the file structure into a databasing structure, and it would make something that should be simple (getting information about a file/dataset) much more complicated, likely requiring external functionality to do so.
I appreciate that one idea in favor of inheritance is that if a study-wide parameter is mistakenly recorded, then it only has to be corrected in one place. However, this entails a large cost of data access on situations where there aren't study-wide errors (hopefully the more common situation!). It is true that text file sizes will increase... but likely not a huge amount compared to the accompanying volumes themselves? And the above points about the usage of BIDS-contained datasets are very persuasive: these won't always be used as single structures-- they will be used in pieces and/or as parts of larger wholes. Inheritance greatly reduces the flexibility of the researcher, and that is quite a large cost.
I am also not a fan of the inheritance principle, as it seems to introduce spooky action at a distance. It is not obvious to the user, and it also requires tools to trawl through the directory to see if details exist. I certainly appreciate that any format relies on consensus, and there may be some who see this as being incredibly useful. However, if I was asked to vote, I would vote for removing this feature of the standard.
The inheritance principle is a policy that makes life considerably easier for (some) data generators, at the expense of additional complexity for data processors. It imposes no cost on data generators, because they can, if they wish, always ignore inheritance entirely and create a sidecar for every file. And with respect to data processors, there's no reason why every data processor needs to come up with their own implementation of inheritance; that's what dependencies are for. E.g., in Python, PyBIDS gives you easy access to the inherited metadata for every file. This doesn't strike me as particularly different from the situation with respect to, e.g., the nifti file format: the fact that the nifti file format could in principle be redefined in a way that might make it easier for someone to implement a new nifti reader is irrelevant, because NiBabel exists.
The main argument I see against inheritance in the thread above that existing tooling doesn't solve is the use case where one wants to work with a partial dataset, e.g., to send a single file to someone else. But that seems like it could be easily solved via new tooling as well; e.g., it wouldn't be very difficult to write a utility that rewrites a project that uses inheritance so that all files have sidecars.
To be clear, I'm not saying we shouldn't drop inheritance; I'm just not really seeing arguments in this thread that seem to justify taking that step, as opposed to motivating the development of more/better tooling.
I am with Tal on this one - it would be very straightforward to develop tools to generate a dataset that has the full set of files, and the benefits of inheritance in terms of efficiency for data generators is substantial.
On Fri, Aug 7, 2020 at 3:07 PM Tal Yarkoni notifications@github.com wrote:
The inheritance principle is a policy that makes life considerably easier for (some) data generators, at the expense of additional complexity for data processors. It imposes no cost on data generators, because they can, if they wish, always ignore inheritance entirely and create a sidecar for every file. And with respect to data processors, there's no reason why every data processor needs to come up with their own implementation of inheritance; that's what dependencies are for. E.g., in Python, PyBIDS gives you easy access to the inherited metadata for every file. This doesn't strike me as particularly different from the situation with respect to, e.g., the nifti file format: the fact that the nifti file format could in principle be redefined in a way that might make it easier for someone to implement a new nifti reader is irrelevant, because NiBabel exists.
The main argument I see against inheritance in the thread above that existing tooling doesn't solve is the use case where one wants to work with a partial dataset, e.g., to send a single file to someone else. But that seems like it could be easily solved via new tooling as well; e.g., it wouldn't be very difficult to write a utility that rewrites a project that uses inheritance so that all files have sidecars.
To be clear, I'm not saying we shouldn't drop inheritance; I'm just not really seeing arguments in this thread that seem to justify taking that step, as opposed to motivating the development of more/better tooling.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-2-devel/issues/36#issuecomment-670734671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUVEEDREBVQUMW6MQRUODR7R3IZANCNFSM4PTLXWAQ .
-- Russell A. Poldrack Albert Ray Lang Professor of Psychology Building 420 Stanford University Stanford, CA 94305
poldrack@stanford.edu http://www.poldracklab.org/
Well, the rigidity of inheritance pushes BIDS into a database format, away from being a convenient data structure for organizing files.
It does necessitate the creation and maintenance of a bevy of tools for validating, updating, querying and simply using files (as @neurolabusc , even searching for details can become highly nontrivial). This creates significant overhead for programmers and data users. What happens when people perform a simple change on the data that might no longer mesh with the original inheritance structure-- for example, resampling the data, so the phase encoding has to be described differently? As @tsalo and others noted above, the inheritance structure will create consistency issues and worries in merging and subsampling datasets. That makes it harder to use public data, instead of building on the strength of data sharing.
Data generation seems to happens once (ignoring fixes; and I understand that making fixes in one spot is a strength of inheritance); however, data processing and analysis are constantly occurring and re-occurring sets of actions at all times-- different groups will use the same starter data differently; people fix problems in the datasets; better methods lead to reanalysis; people drop subjects and/or merge datasets... Data generators are often programmers; data processors can be new or relatively new to programming. These kinds of steps require a huge amount of flexibility, and there are already lots of consistency considerations (coordinate frames, spaces, slice timing, left-right correctness, etc.) to manage just in dealing with individual file headers. Given how much analysis has changed and adapted over the years, we can only expect new methods of both analysis and data acquisition.
Flexibility and relative simplicity, as noted in several of the above comments, are highly desirable features in data processing. Inheritance seems to push BIDS into a different realm than data structures, and in the process creates practical difficulties, requires significant software overhead, and inserts a "middleman" between the researcher and their data.
@mrneont but it remains the case that that all of the problems you mention could be easily eliminated just by having a simple command-line tool that takes a BIDS dataset as input and produces a copy of that dataset, or rewrites it in place, so that it doesn't use inheritance... right? That would be just a single small tool. What you describe as "a bevy of tools for validating, updating, querying and simply using files" does exist, but very little of that has to do with inheritance, and it would continue to exist and be widely used even if inheritance went away. And the mere existence of a dependency in a neuroimaging package doesn't seem to me like much of a barrier; as I mentioned above, if you're working in the Python ecosystem, there's a 98% chance you're already using at minimum NiBabel, and quite possibly a large number of other packages. The kind of thing we're talking about here seems pretty trivial by comparison.
More generally, I don't think it's inheritance that pushes us towards databasing; it's the fundamental nature of the problem being solved: BIDS datasets often have thousands of files, with complex associations (e.g., contained metadata; formal relationships between files defined in the spec; etc.), and a relational DB is the natural way to solve this problem. It's true that a lot of the overhead feels unnecessary if all you want to do is retrieve the metadata for a specific file; but again, developers who don't want to use utility libraries could in principle just call a "de-inheritance" tool once, and then work with raw JSON files themselves if they really wanted to. But as soon as you step away from very simple queries, it becomes extremely useful to have a more robust approach to the problem. For example, using PyBIDS, you can, in a fraction of a second, and a single line of code, return all files that have extension .tsv, are only for task 'nback', and have a TR of 2 seconds (a value read from either the image header or the JSON sidecar). If you didn't have a utility library to do this for you, you'd have to write a lot more code, likely involving multiple nested loops and a lot of runtime file-reading, and it would likely be much slower unless you'd already done a lot of work yourself to optimize the data representation. And if you did do a lot of work to optimize the data representation, you'd very likely end up just using a relational DB and creating your own helper library, which is basically where we already are.
@tyarkoni I agree that finding properties of files does not require inheritance. And if datasets are pre-indexed and rigidly organized, then it might even be slightly faster to find information in some cases. But I don't think parsing JSON files and file headers is that complicated... For example, in a world where every EPI dset has its own information in its header and perhaps an accompanying sidecar, then it strikes me that it is possible to write a shell or Python script that combines "find" or globbing, and looping with a program that reads+tests dataset and JSON field headers-- every software has something like that. And from the user point of view, that is just a single program into which one inputs a data tree name and a list of search criteria. That is a small cost compared to those larger ones mentioned above. Inheritance is not the only way to provide searchability, but inheritance does pretty much ensure that only that external tool can provide that searchability (because it will have to be maintained+adapt with the inheritance structure, and programmers won't know where to find various pieces of data without it).
What happens when things that are inherited today, stop being inherited properties tomorrow due to some practical consideration or change within the field, and so the whole tree has to be redone to be compatible with BIDS? How will that new inheritance structure be passed along to all other datasets that have been distributed? Does everyone have to update all their data?
Maybe it is possible that someone generates a set of data with inheritance, for their own convenience in case they want to fix something, and then they de-inherit it for BIDSifying and distributing to The World (note that I am suuuure someone will then want a re-inheritance program, and thaaat will surely be complicated...). But should BIDS require inheritance? For the above reasons (namely, stability, flexibility and simplicity of data structures), I don't think so. And having different flavors of BIDS with and without inheritance seems somewhat self-defeating to the purpose of a single data structure.
@mrneont :
I appreciate that one idea in favor of inheritance is that if a study-wide parameter is mistakenly recorded, then it only has to be corrected in one place.
@robertoostenveld
I don't consider inheritance so important for machine-written JSON files pertaining to raw data (e.g. by dcm2niix) which can easily be replicated but do think they are of value for human-written data dictionaries.
Besides elevating the burden for human data producers, FWIW, in heudiconv we "take advantage" of inheritance principle by "bubbling up" all common per task fields to the top level. Empowered by knowledge of such heudiconv-specific behavior, a user needs to look only at the single top level .json to see e.g. if the same scanner, RepetitionTime, etc was used across all subjects/sessions for the study. If some expected field is missing from that file -- then git grep
across other json files to see how/where it is different.
As of now it is heudiconv specific and not validated by validator, so is not of blind reliance on. So I would not mind if inheritance principle was e.g. "strengthened" to state that to be inherited higher level file must list only the field/values which are common to all files which inherit from it. Such a rule could be easily validated by the validator.
@mrneont
... the inheritance structure will create consistency issues and worries in merging and subsampling dataset
Indeed for any but trivial dataset "merges" tooling would be necessary anyways so I would say - "serializing" inherited metadata into each .json and then "bubbling up" back after all the other merging operations (adding/harmonizing sessions, harmonizing _acq-
etc fields, etc) could provide an easy algorithmic way to propagate advantages from inheritance to a "merged" dataset.
For subsampling, with an additional rule stated above, no additional action would be needed since "higher level" file would list only the common fields - the only step which might be needed is removal of such no-longer pertinent files (e.g. if subsample includes only anat
files, so all task-
files are to not be included).
re @jbteves et al similar concerns:
At one point I had offered to attempt to program in some of the inheritance functionality, but quickly found that it's actually quite complicated to implement correctly.
IMHO it is just that inheritance principle should be elaborated clearly in a procedural manner so implementation of it in software should become trivial. Canonical implementations (e.g. in PyBIDS as @tyarkoni pointed out) could assist further but IMHO should not be strictly needed.
It is convenient to have information at the highest level possible, and it is convenient to have it at the lowest level possible. Writing software to handle either case, or to move it from one to the other, is not a major issue.
Reliability/verifiability. The place where the information is most reliable is at the most original place, at the lowest level. Moving information up means trusting that the operation was a success, and that it wasn't later corrupted by new data. Leave the ability to verify, rather than the requirement to trust.
I would echo comments by @mrneont @neurolabusc . Inheritance makes datasets highly monolithic and moves us closer to a more fragile and less transparent database representation. Prioritizing write convenience over read reliability I think is also backwards with respect to open data and data reuse.
The suggestion that “tooling is trivial” (in addition to being wrong, as per @jbteves and my experience — and that doesn't even cover the issues arising from alternative implementations) I think misses an important point. Even if it were true, it would be equivalently easy to create tooling which writes data into sidecar files. It is the same logic. Further, in as far as the logic is not very straight-forward, edge cases can be manually accounted for on write, since write is done more infrequently than read.
I think the foremost strength of BIDS is how accessible it is to GNU coreutils, making archives particularly easy to handle for any data scientist, and ensuring very-long-term reliability. In as far as computing tools are ever “finished”, cp
, mv
, cat
, rm
, etc. are as close as it gets. The more we degrade the capacity of such tools to manipulate and inspect datasets, and the more additional tools with an uncertain future we require, particularly for read operations, the more we degrade the viability of the standard.
Having said all that, there are very good arguments for keeping the inheritance principle, namely backwards-compatibility and experimenter convenience. I think a good compromise would be to venture into the proposal that “tooling is easy” and offer users of inheritance (1) a (semi-)automatic tool to re-cast their data into a form without inheritance, and (2) easily make edits across multiple files at once. The latter in particular could be a simple bash script or one-liner depending on the operation aimed at.
As a data curator who uses BIDS in all data collected in the lab and now has generated a large iEEG dataset that uses the inheritance principle, it would be a problem if we now need to go back and create the files for every run. While we can duplicate files, the main issue is that if 1 subject has 20 runs (think dense data in single subjects) and each one needs it's own .json sidecar for both the events and channels that have extra columns, adding 40 files makes the basic file structure explode.
Also, by having one channels.tsv file across the data and data review finds one channel to be bad, eliminating the inheritance principle will require changing this in 20 files, rather than one file. This increases the chance for error. Managing BIDS data without the inheritance principle is thus more difficult if not all the data management is 100% dependent on code based tools.
On re-reading this I'm also thinking inheritance breaks the notion of a “sidecar” file. It's more of a distributed dataset-wide metadata specification which just happens to support documentation at the lowest level... This also clashes with files which support metadata (e.g. the NIfTI header), where we wouldn't put a file with just the header at the top level just as a static overview.
There are some things which are by necessity dataset-specific, such as the BIDS version, the DOI for the dataset release, or the subjects (hopefully we rename it to that, for now it's the “participants”) list. There are top-level files for that. Things which refer to individual files and not the study as a whole should go into the sidecar.
I wonder if we could arrive at changes to inheritance principle which would satisfy the actually present needs/benefits (as e.g. described above by @dorahermes ). E.g. with that use case:
@dorahermes do you think that would work for you?
I agree with the solution by @yarikoptic that sidecar .json or .tsv files should not overwrite or complement. This solution would keep BIDS curation typically simple (not requiring bash scripts or make datasets explode) and would also make it such that the current iEEG-BIDS examples remain valid as far as I know. Imaging browsing this BIDS example if the _events.json file was added for every individual run...
https://groups.google.com/forum/#!topic/bids-discussion/TjxOKEB1DD4
Right now I would say that BIDS is "experiment" centric, where experiment is some collection of data from many participants from a single site that is somehow defined by the experiment (or other justification) that was used to collect the data (openfmri calls these datasets, but that word can be applied at many scales so I am avoiding it). As a consequence, much of the data is "bunched" together at the experiment level, for example participant.tsv and the inheritance principle allowing scan parameters at the highest level. In the future, I would expect that more and more users of shared data will be aggregating data across many different datasets, which is a very common mode for FCP/INDI data. When they do this, it is unlikely that users will want to download all of the data for every experiment, but rather cherry pick the data that they want from the many experiments. I think that this would be much easier to accomplish if we changed the focus of BIDS from the experiment bundle to the lowest denominator - the participant/session level.
This would entail requiring that sidecar JSON and task description files be stored adjacent to the nifti file that they complement and get rid of inheritance. Also, this would involve moving the participant pheno, demo, and assessment data into the session folder. The result of this would be that ALL of the information that is necessary to analyze the dataset is available in the sub/session folder. This enables a more flexible file structure, where users can access only the data that they want.
Inheritance is the weakest part of BIDS and opens the door to a host of problems. I understand that it has the potential of considerably reducing the number of files, but comes at a large cost in complexity and inflexibility. If we do decide to maintain inheritance in the future, it would be much better to explicitly list the inheritance structure in the JSON files, rather than relying on the file structure. This will also make it much more efficient to find meta data (less time probing file systems for files that may or may not exist).
Another reason to remove inheritance is that, while it may be clear how to inherit values in a JSON file, inheriting values in other meta-data file types may be more complex and ambiguous. As raised in https://github.com/bids-standard/bids-specification/issues/337, a single named TSV column requires other columns to be interpreted (i.e. onset and duration, likely event_type) and so certain rows of other columns will also have to be inherited.
Perhaps a practical way forward is to limit the meta-data file types that can participate in inheritance.
Original authors: @ccraddock