ENH: run-procedure for BIDS dataset configuration

jsheunis commented 1 year ago

I'm wondering if it would be useful to add a run-procedure to this extension to configure BIDS+datalad datasets such that all files in the root BIDS directory are committed to git while all the rest of the files, irrespective of type, go to the annex?

I'm thinking of use-cases related to distributed dataset-level metadata extraction and catalog generation. Data in the annex (typically all subfolders of the root BIDS directory) would need to be protected because of data privacy concerns, while data in the root directory (participants.tsv, dataset_description.json, any json sidecar files defined at the root level, any additional dataset-level metadata added at root level) are typically considered non-sensitive or have specifically been edited to be so, and can therefore be considered safe to commit to git.

Configuring a dataset like that (as opposed to annexing all files in the dataset) would allow sufficient metadata extraction on any clone without requiring access to the annex.

The run procedure would add something like this to .gitattributes:

* annex.largefiles=anything
/* annex.largefiles=nothing

The procedure (let's call it rootfiles2git) would be available in this extension because it seems (to me) like it could be generally applicable to BIDS datasets collected in the EU (because of GDPR).

WDYT @yarikoptic @bpoldrack @mslw @cpernet @loj

CPernet commented 1 year ago

that's the 'standard' way to approach a BIDS dataset, make sense to see root directory info (=git) while the rest goes into the annex (also make easy catalog :-)) 👍🏻

bpoldrack commented 1 year ago

Generally, I think it does make sense, but the problem lies in

or have specifically been edited to be so

Editing something to be so, implies that there was a state before that, which must never have been datalad save'd. Such a setup doesn't really allow for mistakes, since you can't easily get things out of git again. Kinda the point of version control. That's why I'd hesitate recommending a specific config from the start. It really depends when in your workflow you'd want to apply that.

jsheunis commented 1 year ago

Fair point, although that problem/challenge exists whether one applies a run-procedure or not. It is something that the people managing the data would need to consider in any case when they turn it into a datalad dataset.

bpoldrack commented 1 year ago

Yes, but a default that annexes everything doesn't lead you in a trap.

Public and restricted content can still be separated in terms of storage. May be a little less convenient, but you don't get in a situation that is really hard to fix.

To be fair: The existence of a procedure isn't exactly a default. I'm a bit worried though, that it goes the way of text2git. Pointed out as convenience in a toy example in documentation and then everybody starts using it without realizing its disadvantages.

mslw commented 1 year ago

I think this is a sane approach, with two caveats (though keep in mind that my knowledge of BIDS spec might be not up to date):

With inheritance principle for BIDS metadata, there is no guarantee that a metadata file in top level directory describes all matching data, as values defined on top level can be overridden by files deeper in the file tree. E.g. fMRI task information: TaskName, RepetitionTime, SliceTiming, etc., in ...task-xyz_bold.json can be defined on any level (either top level or just next to the specific _bold.nii file). It seems to me that it has become a fairly common principle to promote these to top-level (and for good reason), but technically there is no guarantee of dataset-scope.
Speaking of participants.tsv, this is a recommended file, and _commonly used optional columns in participants.tsv files are age, sex, handedness, strain, and strainrrid - I wonder what is the status of these.

loj commented 1 year ago

My biggest concern with this approach is when participants need to be removed. If the participants.tsv file or any other top-level file that contains participant data is saved to git, this becomes problematic.

mih commented 1 year ago

I agree, I would be hesitant to put anything other than a README and a LICENSE into git by default.

Code is another candidate, but only if the file identifiers are at minimum pseudonymized.

yarikoptic commented 1 year ago

I agree, I would be hesitant to put anything other than a README and a LICENSE into git by default.

and CHANGE(S|LOG), with all sensible/support extensions, is indeed the "safest"! Worth smth like cfg_minimal2git or alike (it isn't really BIDS specific probably).

There is always a "hard to strike" balance in what to put into git and what into git-annex. For heudiconv all .json and .tsv go into git besides the _scans.tsv since those are to contain full dates. The minimal above would be "safest" but then forget about lovely git grep etc which I do like to use quite often in BIDS etc datasets.

jsheunis commented 1 year ago

Thanks for everyone's input!

FYI @CPernet there is already a standard BIDS config that does the above process to an extent. See here for an update: https://github.com/datalad/datalad-neuroimaging/pull/115.

datalad / datalad-neuroimaging

ENH: run-procedure for BIDS dataset configuration #114