bids-standard / bids-bep016

BEP016: diffusion derivatives
Creative Commons Attribution 4.0 International
6 stars 7 forks source link

Decisions on BIDS derivatives structure #50

Closed Lestropie closed 4 months ago

Lestropie commented 2 years ago

While I have written a lot of text in various locations regarding core decisions that need to be made regarding the definitions of filesystem paths for DWI derivatives, they may be too verbose or DWI-specific and therefore not be appropriate for widespread community engagement.

It is my intention to first post what I believe to be the viable solutions to these issues. Others are free to comment and even make alternative suggestions. Once the set of viable solutions is established, I will then construct polls to evaluate the degree of community consensus.

The example

We have a hypothetical DWI model called ABC. This model is represented using parameters X and Y. X and Y are of fundamentally different data types, such that it is not possible to store both in a single NIfTI image, and they must be split across multiple images.

For metadata, there is information that is relevant to model ABC as a whole, and there is additionally information that is specific to parameter X and parameter Y separately.

Following fitting of the model to the empirical data, it is possible to derive from X and Y another parameter of interest Z. This may in and of itself require metadata to explain how it was calculated.

Decision 1: Directory structure

(For the sake of discussion of directory structure, I will assume the existence of a new entity with key "model", and two new suffixes: "model", and "mdp" (model-derived parameter). This corresponds to decision 2, option 1 "few suffixes", but is used for demonstrative purposes in the context of decision 1 only, and the two decisions should be considered independent)

Option 1: "Complex inheritance"

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json
        sub-01_model-abc_model.json

Advantages:

Disadvantages:

Option 2: "No inheritance"

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json

Advantages:

Disadvantages:

Option 3: Directory hierarchy

sub-01/
    dwi/
        sub-01_model-abc_model/
            sub-01_model-abc_param-x_model.nii.gz
            sub-01_model-abc_param-x_model.json
            sub-01_model-abc_param-y_model.nii.gz
            sub-01_model-abc_param-y_model.json
            sub-01_model-abc_param-z_mdp.nii.gz
            sub-01_model-abc_param-z_mdp.json
        sub-01_model-abc_model.json

Advantages:

Disadvantages:

Option 4: Tarballs

Option 4a: Tarball with separate JSON

sub-01/
    dwi/
        sub-01_model-abc_model.tar
        sub-01_model-abc_model.json

Contents of file sub-01_model-abc_model.tar:

sub-01_model-abc_param-x_model.nii.gz
sub-01_model-abc_param-x_model.json
sub-01_model-abc_param-y_model.nii.gz
sub-01_model-abc_param-y_model.json
sub-01_model-abc_param-z_mdp.nii.gz
sub-01_model-abc_param-z_mdp.json

Advantages:

Disadvantages:

PS. Apparently there's been a prior discussion regarding tarballing of non-conforming derivatives in BIDS datasets; can anyone provide a link?

Option 4b: Tarball with embedded json

sub-01/
    dwi/
        sub-01_model-abc_model.tar

Contents of file sub-01_model-abc_model.tar:

sub-01_model-abc_param-x_model.nii.gz
sub-01_model-abc_param-x_model.json
sub-01_model-abc_param-y_model.nii.gz
sub-01_model-abc_param-y_model.json
sub-01_model-abc_param-z_mdp.nii.gz
sub-01_model-abc_param-z_mdp.json
sub-01_model-abc_model.json

Advantages (relative to 4a):

Disadvantages (relative to 4a):

Option 5: Hierarchy restricted to JSON

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_model.json

Contents of file sub-01_model-abc_model.json:

{
    "param-x_model": {
        ...
    },
    "param-y_model": {
        ...
    },
    "param-z_mdp": {
        ...
    },
    "ModelURL": "...",
    ....
}

Advantages:

Disadvantages:


Decision 2: File names

(Note that for the sake of these examples, decision 1 option 1 "complex inheritance" is utilised; this is however purely for the sake of generation of examples, and the two decisions should be considered independent)

Option 1: "Few suffixes"

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json
        sub-01_model-abc_model.json

"MDP": "Model-derived parameter" (exact nomenclature can be up for debate)

Advantages:

Disadvantages:

Option 2: "Many suffixes"

sub-01/
    dwi/
        sub-01_model-abc_x.nii.gz
        sub-01_model-abc_x.json
        sub-01_model-abc_y.nii.gz
        sub-01_model-abc_y.json
        sub-01_model-abc_z.nii.gz
        sub-01_model-abc_z.json
        sub-01_model-abc_model.json

Advantages:

Disadvantages:

arokem commented 2 years ago

This is great. Do I understand correctly that poll 2 assumes that poll 1 was already resolved and that option 2 was chosen in that poll? Do we worry that might bias poll 2 somehow? (option 2 is probably my least favorite option in poll 1, fwiw).

Lestropie commented 2 years ago

Added some comments to both decisions 1 and 2, to clarify that in both instances the generation of examples necessitates assuming that one option has been selected from the other decision, but that the two decisions are independent.

francopestilli commented 2 years ago

Option 5: Zipped/Tarball with complete folder sub-01/ dwi/ sub-01_model-abc_model.tar

Contents of file sub-01_model-abc_model.tar:

sub-01_model-abc_param-x_model.nii.gz sub-01_model-abc_param-x_model.json sub-01_model-abc_param-y_model.nii.gz sub-01_model-abc_param-y_model.json sub-01_model-abc_param-z_mdp.nii.gz sub-01_model-abc_param-z_mdp.json sub-01_model-abc_model.json

Advantages:

similar to nii.gz Very very compact storage; multi-resolution view of data Tarballing is also an appealing solution for integration of non-conforming derivatives in a way that is trivially validator compatible

Disadvantages:

Apps would need to have capability to work with tarballs

To clarify my thoughts. Option 4 reminds me of the ANALYZE format which provided a .hdr and a .img set of files. After ANALYZE it became clear that unifying the header (.hdr) and the image (.img) into a single file .nii was convenient.

I wonder whether here we will feel the same. In other words, wha tis the cost of reading the .json inside the tarball?

francopestilli commented 2 years ago

I think Option 4 (or possibly Option 5 with some pros and cons) is the most convenient might provide some speed-ups by allowing search of the info in the top folder name only.

arokem commented 2 years ago

Just to see that I understand: the difference between option 4 and option 5 is that the tarball is not nested under DWI? How would we know that it's related to DWI, and discriminate it from models for FMRI or other modalities? Through the model name?

Lestropie commented 2 years ago

Difference between 4 and 5 is whether the JSON corresponding to model ABC as a whole is or is not embedded within the tarball. Both reside within the dwi/ modality directory. I'll probably reformat it as options 4a and 4b.

Lestropie commented 2 years ago

Option 5: Zipped/Tarball with complete folder

what is the cost of reading the .json inside the tarball?

Just clicked for me (and added to the dot points in the first post): This still really requires the more complex inheritance principle. If you read just one image (regardless of whether it's an intrinsic model output parameter or a model-derived parameter), both the contents of the paired sidecar JSON and the whole-model JSON are applicable.

Lestropie commented 2 years ago

@bids-standard/maintainers: Would very much appreciate any feedback on this thread. I can't keep up with everything happening in BIDS space, so it's possible that similar issues have been encountered elsewhere; also, any decisions made here may set a precedent for many other derivatives BEPs. After feedback from maintainers, if there's no clear consensus I'd like to open up discussion to the wider community.

francopestilli commented 2 years ago

@bids-maintenance We would like to make progress on this issue. We made a proposal, we would like to kindly request attention to allow us to move forward with the DWI-derivatives standard.

@bids-standard/maintainers: Would very much appreciate any feedback on this thread. I can't keep up with everything happening in BIDS space, so it's possible that similar issues have been encountered elsewhere; also, any decisions made here may set a precedent for many other derivatives BEPs. After feedback from maintainers, if there's no clear consensus I'd like to open up a discussion to the wider community.

@PeerHerholz @soichih @bids-standard/derivatives-mri-dwi @effigies

sappelhoff commented 2 years ago

I find Decision 1 Option 4a appealing. Unfortunately I am not aware of the discussion that may have happened on tar balls. Maybe Chris knows, but he'll not be available for the next weeks AFAIK.

For Decision 1 Option 2 you say:

Same as option 1 in that number of files within any given directory could grow very large.

how large are we talking in the worst case?

Based on the discussion of revamping the Inheritance Principle, I am not very fond of Decision 1 Option 1.

Re: Decision 2 --> I think one of the principles in BIDS so far was to use as few suffixes as possible, as many as needed ... so that makes Option 1 appear more favorable for me.

soichih commented 2 years ago

I am not too familiar with BIDS structure, but I'd like to vote for option 2 (or maybe 4a..) on decision 1 for it's simplicity.

I feel that BIDS is becoming too complex already with too many rules that I am not aware of.. I did write a few simple BIDS directory parser for our BIDS data importer library that implements (probably incorrectly) subset of all BIDS structure principles. I assume that I am not the only one who had to write such "broken" parsers as not everyone has access to libraries such as pyBIDS or can use them for their use cases. I also assume that the point of BIDS structure is to make the data structure simple/visible so that it can be used without using a dedicated libraries such a as pyBIDS if they wanted to, otherwise why not just make the whole structure closed within ".bids.tgz" type file format and provide canonical parsers for every programming languages?

No comment on Decision 2.

PeerHerholz commented 2 years ago

Hi folks,

here are my 2 cents.

Re decision 1:

I would either vote for option 1 or 4a, depending on the tarball handling. Did the discussion/link @Lestropie refer to reappear?

Re decision 2:

+1 on @sappelhoff's comment!

Lestropie commented 2 years ago

how large are we talking in the worst case?

Consider two experiments:

The former is perhaps less "exotic".

I think one of the principles in BIDS so far was to use as few suffixes as possible, as many as needed

That's useful. I've been leaning in favour of that myself, hence #46. Would be even better if anyone knows of a link to an explicit statement of such.


Will give a bit more time for maintainers / developers to comment / guide / suggest alternatives, but still like the idea of a community poll.

Lestropie commented 2 years ago

Contra-indication of interest in 4a:

Imagine one fits a model, producing a tarball of core output model parameters, and tarballing. Now one wants to use those parameters to produce a model-derived parameter (eg. an FA map from a tensor model fit). Sidecar information relating to the model fit are still applicable to the model-derived parameter. Would one therefore be obliged to unpack the tarball, add the new file(s), and repackage?

(Added to the list of disadvantages in the original post)

sappelhoff commented 2 years ago

Would one therefore be obliged to unpack the tarball, add the new file(s), and repackage?

as the dataset curator if you do this before finally sharing your dataset ... or when sharing a new version of the data: yes, you'd have to do that and it'd be a bit laborious.

as a user of the dataset, you wouldn't want to edit the dataset anyhow, would you? Wouldn't you save your new outputs elsewhere? For example in a new (derived?) dataset, which would bring us back to the situation above, which is "a bit laborious".

francopestilli commented 2 years ago

Wouldn't you save your new outputs elsewhere?

Good point. I would say yes. That would be preferable. Saved in a new derived dataset.

Lestropie commented 2 years ago

Hmmm, I'm maybe here thinking of a use case outside of that intended. Sometimes I will take a dataset that's been processed using a BIDS App, and do a little bit of subsequent tweaking after the fact, eg. calculating model-derived parameters that weren't calculated by the App, and I'll try to remain vaguely BIDS-compliant when doing so. But that means that the contents no longer reflect the output of that particular App. So tarballing may make such manipulation less convenient, but maybe it's actually a good thing, as such tweaking within the purported output directory of a BIDS App should be discouraged.

The outstanding question would be whether having eg. the model parameters in one derivatives directory, and model-derived parameters in a different derivatives directory. I think that as long as the validator doesn't impose any requirement on model-derived parameters coexisting with the model parameters it should be fine, but some chance I'm overlooking something.

Lestropie commented 2 years ago

Originating from @oesteban

Option 5: Hierarchy restricted to JSON

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_model.json

Contents of file sub-01_model-abc_model.json:

{
    "param-x_model": {
        ...
    },
    "param-y_model": {
        ...
    },
    "param-z_mdp": {
        ...
    },
    "ModelURL": "...",
    ....
}

Advantages:

Disadvantages:

Lestropie commented 2 years ago

Additional suggestion from @oesteban

This is described here as an augmentation of option 3; it does not in and of itself solve the complex inheritance problem.

sub-01/
    dwi/
        model-abc1/
            sub-01_model-abc1_param-x_model.nii.gz
            sub-01_model-abc1_param-x_model.json
            sub-01_model-abc1_param-y_model.nii.gz
            sub-01_model-abc1_param-y_model.json
            sub-01_model-abc1_param-z_mdp.nii.gz
            sub-01_model-abc1_param-z_mdp.json
        model-abc1.json
        model-abc2/
            sub-01_model-abc2_param-x_model.nii.gz
            sub-01_model-abc2_param-x_model.json
            sub-01_model-abc2_param-y_model.nii.gz
            sub-01_model-abc2_param-y_model.json
            sub-01_model-abc2_param-z_mdp.nii.gz
            sub-01_model-abc2_param-z_mdp.json
        model-abc2.json
        models.tsv

Must be content within file models.tsv that provides adequate identifying information for each model fit to be described both in a human-readable way, and for the validator to ensure correspondence between the unique identifiers in this file and the directory names.

Edit: models.tsv could be at the dataset root directory, and have a column encoding modality of each indexed model.

rwblair commented 2 years ago

For decision 1 is the following acceptable?:

sub-01/
    sub-01_model-abc_model.json
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json

This does not address the issue of root level model files needing to use sidecar inheritance inheritance. Also does not deal with large number of files in a directory, but should be valid in the current specification.

Lestropie commented 2 years ago

For decision 1 is the following acceptable?

That precisely replicates the proposed "solution" as it appears in the current specification. However as I argue in https://github.com/bids-standard/bids-specification/pull/1003, it is to me unintuitive, as it involves placing datatype-specific data in a directory that has the specific purpose of disambiguating datatypes.

It could make more sense if alternatively "model-abc_model.json" could appear even higher in the filesystem tree such that it becomes applicable to that model as estimated for multiple subjects; that potentially comes with its own problems (eg. race conditions if parallel participant-level analyses attempt to write to it) but then again is a more advantageous utilisation of the inheritance principle if done properly.

Lestropie commented 4 months ago

Closing following merge of #92.